Re: [PATCH 3/5] dt-bindings: arm: Document Socionext MB86S71 and Fujitsu F-Cue

2017-11-06 Thread Yang Zhang


> On 6 Nov 2017, at 06:58, Andreas Färber  wrote:
> 
>> Am 05.11.2017 um 04:39 schrieb Ard Biesheuvel:
>>> On 4 November 2017 at 20:06, Andreas Färber  wrote:
 Am 04.11.2017 um 23:39 schrieb Ard Biesheuvel:
> On 4 November 2017 at 15:30, Andreas Färber  wrote:
>> Am 04.11.2017 um 22:57 schrieb Ard Biesheuvel:
>>> On 4 November 2017 at 13:44, Andreas Färber  wrote:
>>> Hi everyone,
>>> 
>>> The non-building clk driver has been removed for 4.14, but this patchset
>>> seems stuck on matters of naming and maintenance...
>>> 
 Am 30.06.2017 um 01:18 schrieb Masahiro Yamada:
 Hi Andreas,
 
 2017-06-29 21:53 GMT+09:00 Andreas Färber :
> Hi Masahiro-san,
> 
>> Am 29.06.2017 um 14:18 schrieb Masahiro Yamada:
>> 2017-06-29 1:46 GMT+09:00 Rob Herring :
 On Sun, Jun 25, 2017 at 07:00:18PM +0200, Andreas Färber wrote:
 For consistency with existing SoC bindings, use "fujitsu,mb86s71" 
 but
 socionext.txt.
 
 Signed-off-by: Andreas Färber 
 ---
 Documentation/devicetree/bindings/arm/socionext.txt | 17 
 +
 1 file changed, 17 insertions(+)
 create mode 100644 
 Documentation/devicetree/bindings/arm/socionext.txt
>>> 
>>> Acked-by: Rob Herring 
>>> --
>> 
>> I do not mind this, but
>> please note there are multiple product lines in Socionext
>> because Socionext merged LSI divisions from Panasonic and Fujitsu.
>> 
>> I maintain documents for Socionext UniPhier SoC family
>> (which inherits SoC architecture of Panasonic)
>> in Documentation/devicetree/bindings/arm/uniphier/.
> 
> Actually you seemed to be lacking bindings beyond the cache controller
> for Uniphier:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/devicetree/bindings/arm/uniphier
> 
> The SoC compatible, e.g. "socionext,uniphier-ld11", needs to be 
> defined
> somewhere too, as done here. A git-grep for that particular compatible
> only finds derived clk and reset bindings.
 
 I can care to send a patch later, but it is off-topic here.
>>> 
>>> [The relevance was that had there been any bindings precedence from
>>> UniPhier, it would've influenced my naming choices.]
>>> 
> Using socionext.txt allows you to add those bindings to a shared file;
> if you prefer to host them separately below uniphier/ or as 
> uniphier.txt
 
 I was thinking of this way.
 
 For example, TI has omap/, keystone/, davinci.txt, etc.
 in this directory level.
 
> do you have a better name suggestion for this one? I was trying to 
> keep
> our options open to later add SC2A11 in socionext.txt, and I also saw
> some mb8ac300 or so (MB86S7x predecessor?) in downstream sources, so I
> don't know a good common name for the non-Panasonic parts. And if we
> take fujitsu.txt for MB86S7x to match the vendor prefix then we will
> need a separate file for the new SC2A11 IIUC.
 
 I have no idea.
 Actually, I am not familiar with those SoCs.
 
 I am not sure if there exists a common name for those Fujitsu-derived 
 SoCs.
 I think a SoC family name will be helpful to avoid proliferating
 arch/arm/mach-{mb86s7x,mb8ac300, ...}.
 
 I see some Socionext guys CC'ed in this mail,
 somebody might have information about this.
 
 As I said before, I do not mind adding socionext.txt
 and it seems reasonable enough
 if there is no common name for those SoCs.
 
> Also if you can tell us where the cut between Fujitsu and Socionext
> should be done, we can certainly adapt. NXP is still adding all their
> new SoCs in fsl.txt, it seems.
> (A similar naming issue exists for my not-yet-submitted FM4 patches,
> where it changed owners from Fujitsu to Spansion and then to Cypress.)
 
 Right, vendor names are not future-proof in some cases.
 
 We use "uniphier" because it is convenient to
 make a group of SoCs with similar architecture,
 and it will work even if UniPhier product lines are sold again in the
 future.  :-)
>>> 
>>> Summarizing: Masahiro-san only wants to maintain the UniPhier family of
>>> Socionext SoCs, not this MB86S71. No one from Socionext or Linaro has
>>> volunteered as maintainer for these F-Cue 

Re: [PATCH 3/5] dt-bindings: arm: Document Socionext MB86S71 and Fujitsu F-Cue

2017-11-06 Thread Yang Zhang


> On 6 Nov 2017, at 06:58, Andreas Färber  wrote:
> 
>> Am 05.11.2017 um 04:39 schrieb Ard Biesheuvel:
>>> On 4 November 2017 at 20:06, Andreas Färber  wrote:
 Am 04.11.2017 um 23:39 schrieb Ard Biesheuvel:
> On 4 November 2017 at 15:30, Andreas Färber  wrote:
>> Am 04.11.2017 um 22:57 schrieb Ard Biesheuvel:
>>> On 4 November 2017 at 13:44, Andreas Färber  wrote:
>>> Hi everyone,
>>> 
>>> The non-building clk driver has been removed for 4.14, but this patchset
>>> seems stuck on matters of naming and maintenance...
>>> 
 Am 30.06.2017 um 01:18 schrieb Masahiro Yamada:
 Hi Andreas,
 
 2017-06-29 21:53 GMT+09:00 Andreas Färber :
> Hi Masahiro-san,
> 
>> Am 29.06.2017 um 14:18 schrieb Masahiro Yamada:
>> 2017-06-29 1:46 GMT+09:00 Rob Herring :
 On Sun, Jun 25, 2017 at 07:00:18PM +0200, Andreas Färber wrote:
 For consistency with existing SoC bindings, use "fujitsu,mb86s71" 
 but
 socionext.txt.
 
 Signed-off-by: Andreas Färber 
 ---
 Documentation/devicetree/bindings/arm/socionext.txt | 17 
 +
 1 file changed, 17 insertions(+)
 create mode 100644 
 Documentation/devicetree/bindings/arm/socionext.txt
>>> 
>>> Acked-by: Rob Herring 
>>> --
>> 
>> I do not mind this, but
>> please note there are multiple product lines in Socionext
>> because Socionext merged LSI divisions from Panasonic and Fujitsu.
>> 
>> I maintain documents for Socionext UniPhier SoC family
>> (which inherits SoC architecture of Panasonic)
>> in Documentation/devicetree/bindings/arm/uniphier/.
> 
> Actually you seemed to be lacking bindings beyond the cache controller
> for Uniphier:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/devicetree/bindings/arm/uniphier
> 
> The SoC compatible, e.g. "socionext,uniphier-ld11", needs to be 
> defined
> somewhere too, as done here. A git-grep for that particular compatible
> only finds derived clk and reset bindings.
 
 I can care to send a patch later, but it is off-topic here.
>>> 
>>> [The relevance was that had there been any bindings precedence from
>>> UniPhier, it would've influenced my naming choices.]
>>> 
> Using socionext.txt allows you to add those bindings to a shared file;
> if you prefer to host them separately below uniphier/ or as 
> uniphier.txt
 
 I was thinking of this way.
 
 For example, TI has omap/, keystone/, davinci.txt, etc.
 in this directory level.
 
> do you have a better name suggestion for this one? I was trying to 
> keep
> our options open to later add SC2A11 in socionext.txt, and I also saw
> some mb8ac300 or so (MB86S7x predecessor?) in downstream sources, so I
> don't know a good common name for the non-Panasonic parts. And if we
> take fujitsu.txt for MB86S7x to match the vendor prefix then we will
> need a separate file for the new SC2A11 IIUC.
 
 I have no idea.
 Actually, I am not familiar with those SoCs.
 
 I am not sure if there exists a common name for those Fujitsu-derived 
 SoCs.
 I think a SoC family name will be helpful to avoid proliferating
 arch/arm/mach-{mb86s7x,mb8ac300, ...}.
 
 I see some Socionext guys CC'ed in this mail,
 somebody might have information about this.
 
 As I said before, I do not mind adding socionext.txt
 and it seems reasonable enough
 if there is no common name for those SoCs.
 
> Also if you can tell us where the cut between Fujitsu and Socionext
> should be done, we can certainly adapt. NXP is still adding all their
> new SoCs in fsl.txt, it seems.
> (A similar naming issue exists for my not-yet-submitted FM4 patches,
> where it changed owners from Fujitsu to Spansion and then to Cypress.)
 
 Right, vendor names are not future-proof in some cases.
 
 We use "uniphier" because it is convenient to
 make a group of SoCs with similar architecture,
 and it will work even if UniPhier product lines are sold again in the
 future.  :-)
>>> 
>>> Summarizing: Masahiro-san only wants to maintain the UniPhier family of
>>> Socionext SoCs, not this MB86S71. No one from Socionext or Linaro has
>>> volunteered as maintainer for these F-Cue MB86S71 patches - that seems
>>> to indicate I'll again have to set up a new repository and start
>>> maintaining it myself.
>>> 

Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-13 Thread Yang Zhang

On 2017/8/29 22:56, Michael S. Tsirkin wrote:

On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM.


But are we trading a lot of CPU for a bit of lower latency?


The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.


Isn't it the job of an idle driver to find the best way to
halt the CPU?

It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.

In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!


Quan, Can you help to test it and give result? Thanks.

--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-13 Thread Yang Zhang

On 2017/8/29 22:56, Michael S. Tsirkin wrote:

On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM.


But are we trading a lot of CPU for a bit of lower latency?


The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.


Isn't it the job of an idle driver to find the best way to
halt the CPU?

It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.

In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!


Quan, Can you help to test it and give result? Thanks.

--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/9/1 14:58, Wanpeng Li wrote:

2017-09-01 14:44 GMT+08:00 Yang Zhang <yang.zhang...@gmail.com>:

On 2017/8/29 22:02, Wanpeng Li wrote:


Here is the data we get when running benchmark netperf:

 2. w/ patch:
halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

 3. kvm dynamic poll
halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU



Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like
netperf. In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from



No.If we use poll in KVM side, it will consume more cpu than in guest side.
If use both two, then we can get the performance as only enable guest side
poll but it will cost more cpu because of poll in KVM side. It means we
should disable KVM side poll since it cannot give much improvement than
guest side except consume more cpu.


The customers should have enough knowledge about what's the meaning of
the tunning which you exposed.


We have applied this patch to customize kernel for some real customers 
and we get positive feedback from them since the CPU never run at 100% 
even there is no task running. Also, this helps them to give more CPUs 
to other tasks and reduce the power consumption in their rack. Don't you 
think it is better?




Regards,
Wanpeng Li




Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally



I have said in last time, there already users using idle=poll in there VM,
you *cannot* prevent them doing it. This patch provide a better solution
than unconditional poll, we didn't introduce any worse stuff.


acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.



We will add more conditions than the current algorithm in future. But it's
ok to use one code currently, we will do it in next version.



Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing



--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/9/1 14:58, Wanpeng Li wrote:

2017-09-01 14:44 GMT+08:00 Yang Zhang :

On 2017/8/29 22:02, Wanpeng Li wrote:


Here is the data we get when running benchmark netperf:

 2. w/ patch:
halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

 3. kvm dynamic poll
halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU



Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like
netperf. In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from



No.If we use poll in KVM side, it will consume more cpu than in guest side.
If use both two, then we can get the performance as only enable guest side
poll but it will cost more cpu because of poll in KVM side. It means we
should disable KVM side poll since it cannot give much improvement than
guest side except consume more cpu.


The customers should have enough knowledge about what's the meaning of
the tunning which you exposed.


We have applied this patch to customize kernel for some real customers 
and we get positive feedback from them since the CPU never run at 100% 
even there is no task running. Also, this helps them to give more CPUs 
to other tasks and reduce the power consumption in their rack. Don't you 
think it is better?




Regards,
Wanpeng Li




Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally



I have said in last time, there already users using idle=poll in there VM,
you *cannot* prevent them doing it. This patch provide a better solution
than unconditional poll, we didn't introduce any worse stuff.


acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.



We will add more conditions than the current algorithm in future. But it's
ok to use one code currently, we will do it in next version.



Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing



--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle

2017-09-01 Thread Yang Zhang

On 2017/8/29 20:46, Peter Zijlstra wrote:

On Tue, Aug 29, 2017 at 11:46:41AM +, Yang Zhang wrote:

In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.

one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>


Same broken SoB chain, and not a useful word on why you need to adjust
this crap to begin with. What you want that poll duration to be related
to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
to be idle.


Actually, we should compare the cost of VMEXIT/VMENTER with the real 
duration in idle. We have a rough number of the cost for one 
VMEXIT/VMENTER(it is about 2k~4k cycles depends on the underlying CPU) 
and it introduces 4~5 VMENTER/VMEXITs in idle path which may increase 
about 7us latency in average. So we set the poll duration to 10us by 
default.


Another problem is there is no good way to measure the duration in idle. 
avg_idle is the only way i find so far. Do you have any suggestion to do 
it better? Thanks.


--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle

2017-09-01 Thread Yang Zhang

On 2017/8/29 20:46, Peter Zijlstra wrote:

On Tue, Aug 29, 2017 at 11:46:41AM +, Yang Zhang wrote:

In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.

one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 


Same broken SoB chain, and not a useful word on why you need to adjust
this crap to begin with. What you want that poll duration to be related
to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
to be idle.


Actually, we should compare the cost of VMEXIT/VMENTER with the real 
duration in idle. We have a rough number of the cost for one 
VMEXIT/VMENTER(it is about 2k~4k cycles depends on the underlying CPU) 
and it introduces 4~5 VMENTER/VMEXITs in idle path which may increase 
about 7us latency in average. So we set the poll duration to 10us by 
default.


Another problem is there is no good way to measure the duration in idle. 
avg_idle is the only way i find so far. Do you have any suggestion to do 
it better? Thanks.


--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-09-01 Thread Yang Zhang

On 2017/8/29 21:55, Konrad Rzeszutek Wilk wrote:

On Tue, Aug 29, 2017 at 11:46:35AM +, Yang Zhang wrote:

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
idle path which will polling for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will hurt
performance especially for latency intensive workload like message
passing task. The cost is mainly come from the vmexit which is a
hardware context switch between VM and hypervisor. Our solution is to
poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Jeremy Fitzhardinge <jer...@goop.org>
Cc: Chris Wright <chr...@sous-sol.org>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org


Adding xen-devel.

Juergen, we really should replace Jeremy's name with xen-devel or
your name.. Wasn't there an patch by you that took some of the
mainternship over it?


Hi Konard, I didn't test it in Xen side since i don't have the 
environment but i can add it for Xen in next version if you think it is 
useful to Xen as well.





---
  arch/x86/include/asm/paravirt.h   | 5 +
  arch/x86/include/asm/paravirt_types.h | 6 ++
  arch/x86/kernel/paravirt.c| 6 ++
  3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9ccac19..6d46760 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int 
counter)
  
  #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
  
+static inline void paravirt_idle_poll(void)

+{
+   PVOP_VCALL0(pv_idle_ops.poll);
+}
+
  static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
entries)
  {
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 9ffc36b..cf45726 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -324,6 +324,10 @@ struct pv_lock_ops {
struct paravirt_callee_save vcpu_is_preempted;
  } __no_randomize_layout;
  
+struct pv_idle_ops {

+   void (*poll)(void);
+} __no_randomize_layout;
+
  /* This contains all the paravirt structures: we get a convenient
   * number for each function using the offset which we use to indicate
   * what to patch. */
@@ -334,6 +338,7 @@ struct paravirt_patch_template {
struct pv_irq_ops pv_irq_ops;
struct pv_mmu_ops pv_mmu_ops;
struct pv_lock_ops pv_lock_ops;
+   struct pv_idle_ops pv_idle_ops;
  } __no_randomize_layout;
  
  extern struct pv_info pv_info;

@@ -343,6 +348,7 @@ struct paravirt_patch_template {
  extern struct pv_irq_ops pv_irq_ops;
  extern struct pv_mmu_ops pv_mmu_ops;
  extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
  
  #define PARAVIRT_PATCH(x)	\

(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bc0a849..1b5b247 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
  #ifdef CONFIG_PARAVIRT_SPINLOCKS
.pv_lock_ops = pv_lock_ops,
  #endif
+   .pv_idle_ops = pv_idle_ops,
};
return *((void **) + type);
  }
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
.steal_clock = native_steal_clock,
  };
  
+struct pv_idle_ops pv_idle_ops = {

+   .poll = paravirt_nop,
+};
+
  __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
  EXPORT_SYMBOL(pv_mmu_ops);
  EXPORT_SYMBOL_GPL(pv_info);
  EXPORT_SYMBOL(pv_irq_ops);
+EXPORT_SYMBOL(pv_idle_ops);
--
1.8.3.1




--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-09-01 Thread Yang Zhang

On 2017/8/29 21:55, Konrad Rzeszutek Wilk wrote:

On Tue, Aug 29, 2017 at 11:46:35AM +, Yang Zhang wrote:

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
idle path which will polling for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will hurt
performance especially for latency intensive workload like message
passing task. The cost is mainly come from the vmexit which is a
hardware context switch between VM and hypervisor. Our solution is to
poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Jeremy Fitzhardinge 
Cc: Chris Wright 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Peter Zijlstra 
Cc: Andy Lutomirski 
Cc: "Kirill A. Shutemov" 
Cc: Pan Xinhui 
Cc: Kees Cook 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org


Adding xen-devel.

Juergen, we really should replace Jeremy's name with xen-devel or
your name.. Wasn't there an patch by you that took some of the
mainternship over it?


Hi Konard, I didn't test it in Xen side since i don't have the 
environment but i can add it for Xen in next version if you think it is 
useful to Xen as well.





---
  arch/x86/include/asm/paravirt.h   | 5 +
  arch/x86/include/asm/paravirt_types.h | 6 ++
  arch/x86/kernel/paravirt.c| 6 ++
  3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9ccac19..6d46760 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int 
counter)
  
  #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
  
+static inline void paravirt_idle_poll(void)

+{
+   PVOP_VCALL0(pv_idle_ops.poll);
+}
+
  static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
entries)
  {
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 9ffc36b..cf45726 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -324,6 +324,10 @@ struct pv_lock_ops {
struct paravirt_callee_save vcpu_is_preempted;
  } __no_randomize_layout;
  
+struct pv_idle_ops {

+   void (*poll)(void);
+} __no_randomize_layout;
+
  /* This contains all the paravirt structures: we get a convenient
   * number for each function using the offset which we use to indicate
   * what to patch. */
@@ -334,6 +338,7 @@ struct paravirt_patch_template {
struct pv_irq_ops pv_irq_ops;
struct pv_mmu_ops pv_mmu_ops;
struct pv_lock_ops pv_lock_ops;
+   struct pv_idle_ops pv_idle_ops;
  } __no_randomize_layout;
  
  extern struct pv_info pv_info;

@@ -343,6 +348,7 @@ struct paravirt_patch_template {
  extern struct pv_irq_ops pv_irq_ops;
  extern struct pv_mmu_ops pv_mmu_ops;
  extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
  
  #define PARAVIRT_PATCH(x)	\

(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bc0a849..1b5b247 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
  #ifdef CONFIG_PARAVIRT_SPINLOCKS
.pv_lock_ops = pv_lock_ops,
  #endif
+   .pv_idle_ops = pv_idle_ops,
};
return *((void **) + type);
  }
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
.steal_clock = native_steal_clock,
  };
  
+struct pv_idle_ops pv_idle_ops = {

+   .poll = paravirt_nop,
+};
+
  __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
  EXPORT_SYMBOL(pv_mmu_ops);
  EXPORT_SYMBOL_GPL(pv_info);
  EXPORT_SYMBOL(pv_irq_ops);
+EXPORT_SYMBOL(pv_idle_ops);
--
1.8.3.1




--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/8/29 22:02, Wanpeng Li wrote:

Here is the data we get when running benchmark netperf:

2. w/ patch:
   halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
   halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
   halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
   halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
   halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

3. kvm dynamic poll
   halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
   halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
   halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
   halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
   halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU



Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like
netperf. In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from


No.If we use poll in KVM side, it will consume more cpu than in guest 
side. If use both two, then we can get the performance as only enable 
guest side poll but it will cost more cpu because of poll in KVM side. 
It means we should disable KVM side poll since it cannot give much 
improvement than guest side except consume more cpu.



Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally


I have said in last time, there already users using idle=poll in there 
VM, you *cannot* prevent them doing it. This patch provide a better 
solution than unconditional poll, we didn't introduce any worse stuff.



acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.


We will add more conditions than the current algorithm in future. But 
it's ok to use one code currently, we will do it in next version.




Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/8/29 22:02, Wanpeng Li wrote:

Here is the data we get when running benchmark netperf:

2. w/ patch:
   halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
   halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
   halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
   halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
   halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

3. kvm dynamic poll
   halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
   halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
   halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
   halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
   halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU



Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like
netperf. In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from


No.If we use poll in KVM side, it will consume more cpu than in guest 
side. If use both two, then we can get the performance as only enable 
guest side poll but it will cost more cpu because of poll in KVM side. 
It means we should disable KVM side poll since it cannot give much 
improvement than guest side except consume more cpu.



Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally


I have said in last time, there already users using idle=poll in there 
VM, you *cannot* prevent them doing it. This patch provide a better 
solution than unconditional poll, we didn't introduce any worse stuff.



acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.


We will add more conditions than the current algorithm in future. But 
it's ok to use one code currently, we will do it in next version.




Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/8/29 22:02, Wanpeng Li wrote:

Here is the data we get when running benchmark netperf:

2. w/ patch:
   halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
   halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
   halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
   halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
   halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

3. kvm dynamic poll
   halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
   halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
   halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
   halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
   halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU



Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like > netperf. 
In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from


If we use poll in KVM side, it will consume more cpu than in guest side. 
If use both two, then we can get the same performance as only enable 
guest side poll but it will cost more cpu because of poll KVM side. It 
means we should disable KVM side poll since it cannot give much 
improvement than  guest side except consume more cpu and large latency.



Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally
acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.


We will add more conditions than the current algorithm in future. But 
it's ok to use the one copy currently, we will do it in next version.



--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/8/29 22:02, Wanpeng Li wrote:

Here is the data we get when running benchmark netperf:

2. w/ patch:
   halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
   halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
   halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
   halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
   halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

3. kvm dynamic poll
   halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
   halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
   halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
   halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
   halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU



Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like > netperf. 
In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from


If we use poll in KVM side, it will consume more cpu than in guest side. 
If use both two, then we can get the same performance as only enable 
guest side poll but it will cost more cpu because of poll KVM side. It 
means we should disable KVM side poll since it cannot give much 
improvement than  guest side except consume more cpu and large latency.



Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally
acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.


We will add more conditions than the current algorithm in future. But 
it's ok to use the one copy currently, we will do it in next version.



--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/8/29 19:58, Alexander Graf wrote:

On 08/29/2017 01:46 PM, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

    1. w/o patch:
   2493.14 ns/ctxsw -- 200.3 %CPU
    2. w/ patch:
   halt_poll_threshold=1 -- 1485.96ns/ctxsw -- 201.0 %CPU
   halt_poll_threshold=2 -- 1391.26 ns/ctxsw -- 200.7 %CPU
   halt_poll_threshold=3 -- 1488.55 ns/ctxsw -- 200.1 %CPU
   halt_poll_threshold=50 -- 1159.14 ns/ctxsw -- 201.5 %CPU
    3. kvm dynamic poll
   halt_poll_ns=1 -- 2296.11 ns/ctxsw -- 201.2 %CPU
   halt_poll_ns=2 -- 2599.7 ns/ctxsw -- 201.7 %CPU
   halt_poll_ns=3 -- 2588.68 ns/ctxsw -- 211.6 %CPU
   halt_poll_ns=50 -- 2423.20 ns/ctxsw -- 229.2 %CPU
    4. idle=poll
   2050.1 ns/ctxsw -- 1003 %CPU
    5. idle=mwait
   2188.06 ns/ctxsw -- 206.3 %CPU


Could you please try to create another metric for guest initiated, host 
aborted mwait?


For a quick benchmark, reserve 4 registers for a magic value, set them 
to the magic value before you enter MWAIT in the guest. Then allow 
native MWAIT execution on the host. If you see the guest wants to enter 


I guess you want to allow native MWAIT execution on the guest not host?

with the 4 registers containing the magic contents and no events are 
pending, directly go into the vcpu block function on the host.


Mmm..It is not very clear to me. If guest executes MWAIT without vmexit, 
how to check the register?




That way any time a guest gets naturally aborted while in mwait, it will 
only reenter mwait when an event actually occured. While the guest is 
normally running (and nobody else wants to run on the host), we just 
stay in guest context, but with a sleeping CPU.


Overall, that might give us even better performance, as it allows for 
turbo boost and HT to work properly.


In our testing,  we have enough cores(32cores) but only 10VCPUs, so in 
the best case, we may see the same performance as poll.


--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-09-01 Thread Yang Zhang

On 2017/8/29 19:58, Alexander Graf wrote:

On 08/29/2017 01:46 PM, Yang Zhang wrote:

Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

    1. w/o patch:
   2493.14 ns/ctxsw -- 200.3 %CPU
    2. w/ patch:
   halt_poll_threshold=1 -- 1485.96ns/ctxsw -- 201.0 %CPU
   halt_poll_threshold=2 -- 1391.26 ns/ctxsw -- 200.7 %CPU
   halt_poll_threshold=3 -- 1488.55 ns/ctxsw -- 200.1 %CPU
   halt_poll_threshold=50 -- 1159.14 ns/ctxsw -- 201.5 %CPU
    3. kvm dynamic poll
   halt_poll_ns=1 -- 2296.11 ns/ctxsw -- 201.2 %CPU
   halt_poll_ns=2 -- 2599.7 ns/ctxsw -- 201.7 %CPU
   halt_poll_ns=3 -- 2588.68 ns/ctxsw -- 211.6 %CPU
   halt_poll_ns=50 -- 2423.20 ns/ctxsw -- 229.2 %CPU
    4. idle=poll
   2050.1 ns/ctxsw -- 1003 %CPU
    5. idle=mwait
   2188.06 ns/ctxsw -- 206.3 %CPU


Could you please try to create another metric for guest initiated, host 
aborted mwait?


For a quick benchmark, reserve 4 registers for a magic value, set them 
to the magic value before you enter MWAIT in the guest. Then allow 
native MWAIT execution on the host. If you see the guest wants to enter 


I guess you want to allow native MWAIT execution on the guest not host?

with the 4 registers containing the magic contents and no events are 
pending, directly go into the vcpu block function on the host.


Mmm..It is not very clear to me. If guest executes MWAIT without vmexit, 
how to check the register?




That way any time a guest gets naturally aborted while in mwait, it will 
only reenter mwait when an event actually occured. While the guest is 
normally running (and nobody else wants to run on the host), we just 
stay in guest context, but with a sleeping CPU.


Overall, that might give us even better performance, as it allows for 
turbo boost and HT to work properly.


In our testing,  we have enough cores(32cores) but only 10VCPUs, so in 
the best case, we may see the same performance as poll.


--
Yang
Alibaba Cloud Computing


[RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-08-29 Thread Yang Zhang
So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
idle path which will polling for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will hurt
performance especially for latency intensive workload like message
passing task. The cost is mainly come from the vmexit which is a
hardware context switch between VM and hypervisor. Our solution is to
poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Jeremy Fitzhardinge <jer...@goop.org>
Cc: Chris Wright <chr...@sous-sol.org>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h   | 5 +
 arch/x86/include/asm/paravirt_types.h | 6 ++
 arch/x86/kernel/paravirt.c| 6 ++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9ccac19..6d46760 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int 
counter)
 
 #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
 
+static inline void paravirt_idle_poll(void)
+{
+   PVOP_VCALL0(pv_idle_ops.poll);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
entries)
 {
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 9ffc36b..cf45726 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -324,6 +324,10 @@ struct pv_lock_ops {
struct paravirt_callee_save vcpu_is_preempted;
 } __no_randomize_layout;
 
+struct pv_idle_ops {
+   void (*poll)(void);
+} __no_randomize_layout;
+
 /* This contains all the paravirt structures: we get a convenient
  * number for each function using the offset which we use to indicate
  * what to patch. */
@@ -334,6 +338,7 @@ struct paravirt_patch_template {
struct pv_irq_ops pv_irq_ops;
struct pv_mmu_ops pv_mmu_ops;
struct pv_lock_ops pv_lock_ops;
+   struct pv_idle_ops pv_idle_ops;
 } __no_randomize_layout;
 
 extern struct pv_info pv_info;
@@ -343,6 +348,7 @@ struct paravirt_patch_template {
 extern struct pv_irq_ops pv_irq_ops;
 extern struct pv_mmu_ops pv_mmu_ops;
 extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
 
 #define PARAVIRT_PATCH(x)  \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bc0a849..1b5b247 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
.pv_lock_ops = pv_lock_ops,
 #endif
+   .pv_idle_ops = pv_idle_ops,
};
return *((void **) + type);
 }
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
.steal_clock = native_steal_clock,
 };
 
+struct pv_idle_ops pv_idle_ops = {
+   .poll = paravirt_nop,
+};
+
 __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 EXPORT_SYMBOL(pv_mmu_ops);
 EXPORT_SYMBOL_GPL(pv_info);
 EXPORT_SYMBOL(pv_irq_ops);
+EXPORT_SYMBOL(pv_idle_ops);
-- 
1.8.3.1



[RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-08-29 Thread Yang Zhang
So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
idle path which will polling for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will hurt
performance especially for latency intensive workload like message
passing task. The cost is mainly come from the vmexit which is a
hardware context switch between VM and hypervisor. Our solution is to
poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Jeremy Fitzhardinge 
Cc: Chris Wright 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Peter Zijlstra 
Cc: Andy Lutomirski 
Cc: "Kirill A. Shutemov" 
Cc: Pan Xinhui 
Cc: Kees Cook 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h   | 5 +
 arch/x86/include/asm/paravirt_types.h | 6 ++
 arch/x86/kernel/paravirt.c| 6 ++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9ccac19..6d46760 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int 
counter)
 
 #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
 
+static inline void paravirt_idle_poll(void)
+{
+   PVOP_VCALL0(pv_idle_ops.poll);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
entries)
 {
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 9ffc36b..cf45726 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -324,6 +324,10 @@ struct pv_lock_ops {
struct paravirt_callee_save vcpu_is_preempted;
 } __no_randomize_layout;
 
+struct pv_idle_ops {
+   void (*poll)(void);
+} __no_randomize_layout;
+
 /* This contains all the paravirt structures: we get a convenient
  * number for each function using the offset which we use to indicate
  * what to patch. */
@@ -334,6 +338,7 @@ struct paravirt_patch_template {
struct pv_irq_ops pv_irq_ops;
struct pv_mmu_ops pv_mmu_ops;
struct pv_lock_ops pv_lock_ops;
+   struct pv_idle_ops pv_idle_ops;
 } __no_randomize_layout;
 
 extern struct pv_info pv_info;
@@ -343,6 +348,7 @@ struct paravirt_patch_template {
 extern struct pv_irq_ops pv_irq_ops;
 extern struct pv_mmu_ops pv_mmu_ops;
 extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
 
 #define PARAVIRT_PATCH(x)  \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bc0a849..1b5b247 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
.pv_lock_ops = pv_lock_ops,
 #endif
+   .pv_idle_ops = pv_idle_ops,
};
return *((void **) + type);
 }
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
.steal_clock = native_steal_clock,
 };
 
+struct pv_idle_ops pv_idle_ops = {
+   .poll = paravirt_nop,
+};
+
 __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 EXPORT_SYMBOL(pv_mmu_ops);
 EXPORT_SYMBOL_GPL(pv_info);
 EXPORT_SYMBOL(pv_irq_ops);
+EXPORT_SYMBOL(pv_idle_ops);
-- 
1.8.3.1



[RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops

2017-08-29 Thread Yang Zhang
Although smart idle poll has nothing to do with paravirt, it can not
bring any benifit to native. So we only enable it when Linux runs as a
KVM guest(it can extend to other hypervisor like Xen, HyperV and
VMware).

Introduce per-CPU variable poll_duration_ns to control the max poll
time.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/kvm.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d04e30e..7d84a02 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,6 +75,7 @@ static int parse_no_kvmclock_vsyscall(char *arg)
 
 early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 
+static DEFINE_PER_CPU(unsigned long, poll_duration_ns);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -357,6 +358,29 @@ static void kvm_guest_cpu_init(void)
kvm_register_steal_time();
 }
 
+static void kvm_idle_poll(void)
+{
+   unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+   ktime_t start, cur, stop;
+
+   start = cur = ktime_get();
+   stop = ktime_add_ns(ktime_get(), poll_duration);
+
+   do {
+   if (need_resched())
+   break;
+   cur = ktime_get();
+   } while (ktime_before(cur, stop));
+}
+
+static void kvm_guest_idle_init(void)
+{
+   if (!kvm_para_available())
+   return;
+
+   pv_idle_ops.poll = kvm_idle_poll;
+}
+
 static void kvm_pv_disable_apf(void)
 {
if (!__this_cpu_read(apf_reason.enabled))
@@ -492,6 +516,8 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
 #endif
 
+   kvm_guest_idle_init();
+
/*
 * Hard lockup detection is enabled by default. Disable it, as guests
 * can get false positives too easily, for example if the host is
-- 
1.8.3.1



[RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops

2017-08-29 Thread Yang Zhang
Although smart idle poll has nothing to do with paravirt, it can not
bring any benifit to native. So we only enable it when Linux runs as a
KVM guest(it can extend to other hypervisor like Xen, HyperV and
VMware).

Introduce per-CPU variable poll_duration_ns to control the max poll
time.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Paolo Bonzini 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/kvm.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d04e30e..7d84a02 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,6 +75,7 @@ static int parse_no_kvmclock_vsyscall(char *arg)
 
 early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 
+static DEFINE_PER_CPU(unsigned long, poll_duration_ns);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -357,6 +358,29 @@ static void kvm_guest_cpu_init(void)
kvm_register_steal_time();
 }
 
+static void kvm_idle_poll(void)
+{
+   unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+   ktime_t start, cur, stop;
+
+   start = cur = ktime_get();
+   stop = ktime_add_ns(ktime_get(), poll_duration);
+
+   do {
+   if (need_resched())
+   break;
+   cur = ktime_get();
+   } while (ktime_before(cur, stop));
+}
+
+static void kvm_guest_idle_init(void)
+{
+   if (!kvm_para_available())
+   return;
+
+   pv_idle_ops.poll = kvm_idle_poll;
+}
+
 static void kvm_pv_disable_apf(void)
 {
if (!__this_cpu_read(apf_reason.enabled))
@@ -492,6 +516,8 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
 #endif
 
+   kvm_guest_idle_init();
+
/*
 * Hard lockup detection is enabled by default. Disable it, as guests
 * can get false positives too easily, for example if the host is
-- 
1.8.3.1



[RFC PATCH v2 6/7] KVM guest: introduce smart idle poll algorithm

2017-08-29 Thread Yang Zhang
using smart idle poll to reduce the useless poll when system is idle.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/kvm.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7d84a02..ffc8307 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -373,12 +373,53 @@ static void kvm_idle_poll(void)
} while (ktime_before(cur, stop));
 }
 
+static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
+unsigned int max)
+{
+   unsigned int val;
+
+   /* set base poll time to 1ns */
+   if (old == 0 && grow)
+   return 1;
+
+   val = old * grow;
+   if (val > max)
+   val = max;
+
+   return val;
+}
+
+static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
+{
+   if (shrink == 0)
+   return 0;
+
+   return old / shrink;
+}
+
+static int kvm_idle_update_poll_duration(unsigned long idle_duration)
+{
+   unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+
+   if (poll_duration && idle_duration > poll_threshold_ns)
+   poll_duration = shrink_poll_ns(poll_duration, poll_shrink);
+   else if (poll_duration < poll_threshold_ns &&
+idle_duration < poll_threshold_ns)
+   poll_duration = grow_poll_ns(poll_duration, poll_grow,
+poll_threshold_ns);
+
+   this_cpu_write(poll_duration_ns, poll_duration);
+
+   return 0;
+}
+
 static void kvm_guest_idle_init(void)
 {
if (!kvm_para_available())
return;
 
pv_idle_ops.poll = kvm_idle_poll;
+   pv_idle_ops.update = kvm_idle_update_poll_duration;
 }
 
 static void kvm_pv_disable_apf(void)
-- 
1.8.3.1



[RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll

2017-08-29 Thread Yang Zhang
To reduce the cost of poll, we introduce three sysctl to control the
poll time.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Jonathan Corbet <cor...@lwn.net>
Cc: Jeremy Fitzhardinge <jer...@goop.org>
Cc: Chris Wright <chr...@sous-sol.org>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: "Luis R. Rodriguez" <mcg...@kernel.org>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Mauro Carvalho Chehab <mche...@kernel.org>
Cc: Krzysztof Kozlowski <k...@kernel.org>
Cc: Josh Poimboeuf <jpoim...@redhat.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Petr Mladek <pmla...@suse.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Jessica Yu <j...@redhat.com>
Cc: Larry Finger <larry.fin...@lwfinger.net>
Cc: zijun_hu <zijun...@htc.com>
Cc: Baoquan He <b...@redhat.com>
Cc: Johannes Berg <johannes.b...@intel.com>
Cc: Ian Abbott <abbo...@mev.co.uk>
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-fsde...@vger.kernel.org
---
 Documentation/sysctl/kernel.txt | 25 +
 arch/x86/kernel/paravirt.c  |  4 
 include/linux/kernel.h  |  6 ++
 kernel/sysctl.c | 23 +++
 4 files changed, 58 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow   [ X86 only ]
+- poll_shrink [ X86 only ]
+- poll_threshold_ns   [ X86 only ]
 - powersave-nap   [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
 
 ==
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
.update = paravirt_nop,
 };
 
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
 __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
 
 extern bool crash_kexec_post_notifiers;
 
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.extra2 = ,
},
 #endif
+#ifdef CONFIG_PARAVIRT
+   {
+   .procname   = "halt_poll_threshold",
+   .data   = _threshold_ns,
+   .maxlen = sizeof(unsigned long),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "halt_poll_grow",
+   .data   = _grow,
+   .maxlen =

[RFC PATCH v2 6/7] KVM guest: introduce smart idle poll algorithm

2017-08-29 Thread Yang Zhang
using smart idle poll to reduce the useless poll when system is idle.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Paolo Bonzini 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/kvm.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7d84a02..ffc8307 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -373,12 +373,53 @@ static void kvm_idle_poll(void)
} while (ktime_before(cur, stop));
 }
 
+static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
+unsigned int max)
+{
+   unsigned int val;
+
+   /* set base poll time to 1ns */
+   if (old == 0 && grow)
+   return 1;
+
+   val = old * grow;
+   if (val > max)
+   val = max;
+
+   return val;
+}
+
+static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
+{
+   if (shrink == 0)
+   return 0;
+
+   return old / shrink;
+}
+
+static int kvm_idle_update_poll_duration(unsigned long idle_duration)
+{
+   unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+
+   if (poll_duration && idle_duration > poll_threshold_ns)
+   poll_duration = shrink_poll_ns(poll_duration, poll_shrink);
+   else if (poll_duration < poll_threshold_ns &&
+idle_duration < poll_threshold_ns)
+   poll_duration = grow_poll_ns(poll_duration, poll_grow,
+poll_threshold_ns);
+
+   this_cpu_write(poll_duration_ns, poll_duration);
+
+   return 0;
+}
+
 static void kvm_guest_idle_init(void)
 {
if (!kvm_para_available())
return;
 
pv_idle_ops.poll = kvm_idle_poll;
+   pv_idle_ops.update = kvm_idle_update_poll_duration;
 }
 
 static void kvm_pv_disable_apf(void)
-- 
1.8.3.1



[RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll

2017-08-29 Thread Yang Zhang
To reduce the cost of poll, we introduce three sysctl to control the
poll time.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Jonathan Corbet 
Cc: Jeremy Fitzhardinge 
Cc: Chris Wright 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: "Luis R. Rodriguez" 
Cc: Kees Cook 
Cc: Mauro Carvalho Chehab 
Cc: Krzysztof Kozlowski 
Cc: Josh Poimboeuf 
Cc: Andrew Morton 
Cc: Petr Mladek 
Cc: Peter Zijlstra 
Cc: Jessica Yu 
Cc: Larry Finger 
Cc: zijun_hu 
Cc: Baoquan He 
Cc: Johannes Berg 
Cc: Ian Abbott 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-fsde...@vger.kernel.org
---
 Documentation/sysctl/kernel.txt | 25 +
 arch/x86/kernel/paravirt.c  |  4 
 include/linux/kernel.h  |  6 ++
 kernel/sysctl.c | 23 +++
 4 files changed, 58 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow   [ X86 only ]
+- poll_shrink [ X86 only ]
+- poll_threshold_ns   [ X86 only ]
 - powersave-nap   [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
 
 ==
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
.update = paravirt_nop,
 };
 
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
 __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
 
 extern bool crash_kexec_post_notifiers;
 
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.extra2 = ,
},
 #endif
+#ifdef CONFIG_PARAVIRT
+   {
+   .procname   = "halt_poll_threshold",
+   .data   = _threshold_ns,
+   .maxlen = sizeof(unsigned long),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "halt_poll_grow",
+   .data   = _grow,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "halt_poll_shrink",
+   .data   = _shrink,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
{ }
 };
 
-- 
1.8.3.1



[RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-08-29 Thread Yang Zhang
Some latency-intensive workload will see obviously performance 
drop when running inside VM. The main reason is that the overhead 
is amplified when running inside VM. The most cost i have seen is 
inside idle path. 

This patch introduces a new mechanism to poll for a while before 
entering idle state. If schedule is needed during poll, then we 
don't need to goes through the heavy overhead path. 

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

   1. w/o patch:
  2493.14 ns/ctxsw -- 200.3 %CPU
   
   2. w/ patch:
  halt_poll_threshold=1 -- 1485.96ns/ctxsw -- 201.0 %CPU
  halt_poll_threshold=2 -- 1391.26 ns/ctxsw -- 200.7 %CPU
  halt_poll_threshold=3 -- 1488.55 ns/ctxsw -- 200.1 %CPU
  halt_poll_threshold=50 -- 1159.14 ns/ctxsw -- 201.5 %CPU
   
   3. kvm dynamic poll
  halt_poll_ns=1 -- 2296.11 ns/ctxsw -- 201.2 %CPU
  halt_poll_ns=2 -- 2599.7 ns/ctxsw -- 201.7 %CPU
  halt_poll_ns=3 -- 2588.68 ns/ctxsw -- 211.6 %CPU
  halt_poll_ns=50 -- 2423.20 ns/ctxsw -- 229.2 %CPU
   
   4. idle=poll
  2050.1 ns/ctxsw -- 1003 %CPU
   
   5. idle=mwait
  2188.06 ns/ctxsw -- 206.3 %CPU

Here is the data we get when running benchmark netperf:

   1. w/o patch:
  14556.8 bits/s  -- 144.2 %CPU

   2. w/ patch:
  halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
  halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
  halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
  halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
  halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

   3. kvm dynamic poll
  halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
  halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
  halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
  halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
  halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU

   4. idle=poll in guest
  18441.3bit/s -- 1003 %CPU

   5. idle=mwait in guest
  15760.6  bits/s  -- 157.6 %CPU

V1 -> V2:
- integrate the smart halt poll into paravirt code
- use idle_stamp instead of check_poll
- since it hard to get whether vcpu is the only task in pcpu, so we
  don't consider it in this series.(May improve it in future)

Yang Zhang (7):
  x86/paravirt: Add pv_idle_ops to paravirt ops
  KVM guest: register kvm_idle_poll for pv_idle_ops
  sched/idle: Add poll before enter real idle path
  x86/paravirt: Add update in x86/paravirt pv_idle_ops
  Documentation: Add three sysctls for smart idle poll
  KVM guest: introduce smart idle poll algorithm
  sched/idle: update poll time when wakeup from idle

 Documentation/sysctl/kernel.txt   | 25 +
 arch/x86/include/asm/paravirt.h   | 10 ++
 arch/x86/include/asm/paravirt_types.h |  7 
 arch/x86/kernel/kvm.c | 67 +++
 arch/x86/kernel/paravirt.c| 11 ++
 arch/x86/kernel/process.c |  7 
 include/linux/kernel.h|  6 
 include/linux/sched/idle.h|  4 +++
 kernel/sched/core.c   |  4 +++
 kernel/sched/idle.c   |  9 +
 kernel/sysctl.c   | 23 
 11 files changed, 173 insertions(+)

-- 
1.8.3.1



[RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-08-29 Thread Yang Zhang
Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Kyle Huey <m...@kylehuey.com>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Len Brown <len.br...@intel.com>
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/process.c | 7 +++
 kernel/sched/idle.c   | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
x86_idle();
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void arch_cpu_idle_poll(void)
+{
+   paravirt_idle_poll();
+}
+#endif
+
 /*
  * We use this if we don't have any better idle routine..
  */
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 }
 
 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,7 @@ static void do_idle(void)
 */
 
__current_set_polling();
+   arch_cpu_idle_poll();
quiet_vmstat();
tick_nohz_idle_enter();
 
-- 
1.8.3.1



[RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle

2017-08-29 Thread Yang Zhang
In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.

one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/sched/idle.h | 4 
 kernel/sched/core.c| 4 
 kernel/sched/idle.c| 7 +++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 5ca63eb..6e0554d 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -12,6 +12,10 @@ enum cpu_idle_type {
 
 extern void wake_up_if_idle(int cpu);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+extern void update_poll_duration(unsigned long idle_duration);
+#endif
+
 /*
  * Idle thread specific functions to determine the need_resched
  * polling state.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20..25be9a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1678,6 +1678,10 @@ static void ttwu_do_wakeup(struct rq *rq, struct 
task_struct *p, int wake_flags,
if (rq->avg_idle > max)
rq->avg_idle = max;
 
+#if defined(CONFIG_PARAVIRT)
+   update_poll_duration(rq->avg_idle);
+#endif
+
rq->idle_stamp = 0;
}
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b374744..7eb8559 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -101,6 +101,13 @@ void __cpuidle default_idle_call(void)
}
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void update_poll_duration(unsigned long idle_duration)
+{
+   paravirt_idle_update_poll_duration(idle_duration);
+}
+#endif
+
 static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
  int next_state)
 {
-- 
1.8.3.1



[RFC PATCH v2 0/7] x86/idle: add halt poll support

2017-08-29 Thread Yang Zhang
Some latency-intensive workload will see obviously performance 
drop when running inside VM. The main reason is that the overhead 
is amplified when running inside VM. The most cost i have seen is 
inside idle path. 

This patch introduces a new mechanism to poll for a while before 
entering idle state. If schedule is needed during poll, then we 
don't need to goes through the heavy overhead path. 

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

   1. w/o patch:
  2493.14 ns/ctxsw -- 200.3 %CPU
   
   2. w/ patch:
  halt_poll_threshold=1 -- 1485.96ns/ctxsw -- 201.0 %CPU
  halt_poll_threshold=2 -- 1391.26 ns/ctxsw -- 200.7 %CPU
  halt_poll_threshold=3 -- 1488.55 ns/ctxsw -- 200.1 %CPU
  halt_poll_threshold=50 -- 1159.14 ns/ctxsw -- 201.5 %CPU
   
   3. kvm dynamic poll
  halt_poll_ns=1 -- 2296.11 ns/ctxsw -- 201.2 %CPU
  halt_poll_ns=2 -- 2599.7 ns/ctxsw -- 201.7 %CPU
  halt_poll_ns=3 -- 2588.68 ns/ctxsw -- 211.6 %CPU
  halt_poll_ns=50 -- 2423.20 ns/ctxsw -- 229.2 %CPU
   
   4. idle=poll
  2050.1 ns/ctxsw -- 1003 %CPU
   
   5. idle=mwait
  2188.06 ns/ctxsw -- 206.3 %CPU

Here is the data we get when running benchmark netperf:

   1. w/o patch:
  14556.8 bits/s  -- 144.2 %CPU

   2. w/ patch:
  halt_poll_threshold=1 -- 15803.89 bits/s -- 159.5 %CPU
  halt_poll_threshold=2 -- 15899.04 bits/s -- 161.5 %CPU
  halt_poll_threshold=3 -- 15642.38 bits/s -- 161.8 %CPU
  halt_poll_threshold=4 -- 18040.76 bits/s -- 184.0 %CPU
  halt_poll_threshold=5 -- 18877.61 bits/s -- 197.3 %CPU

   3. kvm dynamic poll
  halt_poll_ns=1 -- 15876.00 bits/s -- 172.2 %CPU
  halt_poll_ns=2 -- 15602.58 bits/s -- 185.4 %CPU
  halt_poll_ns=3 -- 15930.69 bits/s -- 194.4 %CPU
  halt_poll_ns=4 -- 16413.09 bits/s -- 195.3 %CPU
  halt_poll_ns=5 -- 16417.42 bits/s -- 196.3 %CPU

   4. idle=poll in guest
  18441.3bit/s -- 1003 %CPU

   5. idle=mwait in guest
  15760.6  bits/s  -- 157.6 %CPU

V1 -> V2:
- integrate the smart halt poll into paravirt code
- use idle_stamp instead of check_poll
- since it hard to get whether vcpu is the only task in pcpu, so we
  don't consider it in this series.(May improve it in future)

Yang Zhang (7):
  x86/paravirt: Add pv_idle_ops to paravirt ops
  KVM guest: register kvm_idle_poll for pv_idle_ops
  sched/idle: Add poll before enter real idle path
  x86/paravirt: Add update in x86/paravirt pv_idle_ops
  Documentation: Add three sysctls for smart idle poll
  KVM guest: introduce smart idle poll algorithm
  sched/idle: update poll time when wakeup from idle

 Documentation/sysctl/kernel.txt   | 25 +
 arch/x86/include/asm/paravirt.h   | 10 ++
 arch/x86/include/asm/paravirt_types.h |  7 
 arch/x86/kernel/kvm.c | 67 +++
 arch/x86/kernel/paravirt.c| 11 ++
 arch/x86/kernel/process.c |  7 
 include/linux/kernel.h|  6 
 include/linux/sched/idle.h|  4 +++
 kernel/sched/core.c   |  4 +++
 kernel/sched/idle.c   |  9 +
 kernel/sysctl.c   | 23 
 11 files changed, 173 insertions(+)

-- 
1.8.3.1



[RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path

2017-08-29 Thread Yang Zhang
Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Kyle Huey 
Cc: Andy Lutomirski 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/process.c | 7 +++
 kernel/sched/idle.c   | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
x86_idle();
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void arch_cpu_idle_poll(void)
+{
+   paravirt_idle_poll();
+}
+#endif
+
 /*
  * We use this if we don't have any better idle routine..
  */
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 }
 
 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,7 @@ static void do_idle(void)
 */
 
__current_set_polling();
+   arch_cpu_idle_poll();
quiet_vmstat();
tick_nohz_idle_enter();
 
-- 
1.8.3.1



[RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle

2017-08-29 Thread Yang Zhang
In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.

one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/sched/idle.h | 4 
 kernel/sched/core.c| 4 
 kernel/sched/idle.c| 7 +++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 5ca63eb..6e0554d 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -12,6 +12,10 @@ enum cpu_idle_type {
 
 extern void wake_up_if_idle(int cpu);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+extern void update_poll_duration(unsigned long idle_duration);
+#endif
+
 /*
  * Idle thread specific functions to determine the need_resched
  * polling state.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20..25be9a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1678,6 +1678,10 @@ static void ttwu_do_wakeup(struct rq *rq, struct 
task_struct *p, int wake_flags,
if (rq->avg_idle > max)
rq->avg_idle = max;
 
+#if defined(CONFIG_PARAVIRT)
+   update_poll_duration(rq->avg_idle);
+#endif
+
rq->idle_stamp = 0;
}
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b374744..7eb8559 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -101,6 +101,13 @@ void __cpuidle default_idle_call(void)
}
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void update_poll_duration(unsigned long idle_duration)
+{
+   paravirt_idle_update_poll_duration(idle_duration);
+}
+#endif
+
 static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
  int next_state)
 {
-- 
1.8.3.1



[RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops

2017-08-29 Thread Yang Zhang
.update is used to adjust the next poll time.

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
Signed-off-by: Quan Xu <quan@gmail.com>
Cc: Jeremy Fitzhardinge <jer...@goop.org>
Cc: Chris Wright <chr...@sous-sol.org>
Cc: Alok Kataria <akata...@vmware.com>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h   | 5 +
 arch/x86/include/asm/paravirt_types.h | 1 +
 arch/x86/kernel/paravirt.c| 1 +
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6d46760..32e1c06 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -207,6 +207,11 @@ static inline void paravirt_idle_poll(void)
PVOP_VCALL0(pv_idle_ops.poll);
 }
 
+static inline void paravirt_idle_update_poll_duration(unsigned long duration)
+{
+   PVOP_VCALL1(pv_idle_ops.update, duration);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
entries)
 {
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index cf45726..3b4f95a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -326,6 +326,7 @@ struct pv_lock_ops {
 
 struct pv_idle_ops {
void (*poll)(void);
+   void (*update)(unsigned long);
 } __no_randomize_layout;
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1b5b247..a11b2c2 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -315,6 +315,7 @@ struct pv_time_ops pv_time_ops = {
 
 struct pv_idle_ops pv_idle_ops = {
.poll = paravirt_nop,
+   .update = paravirt_nop,
 };
 
 __visible struct pv_irq_ops pv_irq_ops = {
-- 
1.8.3.1



[RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops

2017-08-29 Thread Yang Zhang
.update is used to adjust the next poll time.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Jeremy Fitzhardinge 
Cc: Chris Wright 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Peter Zijlstra 
Cc: Andy Lutomirski 
Cc: "Kirill A. Shutemov" 
Cc: Pan Xinhui 
Cc: Kees Cook 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h   | 5 +
 arch/x86/include/asm/paravirt_types.h | 1 +
 arch/x86/kernel/paravirt.c| 1 +
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6d46760..32e1c06 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -207,6 +207,11 @@ static inline void paravirt_idle_poll(void)
PVOP_VCALL0(pv_idle_ops.poll);
 }
 
+static inline void paravirt_idle_update_poll_duration(unsigned long duration)
+{
+   PVOP_VCALL1(pv_idle_ops.update, duration);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
entries)
 {
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index cf45726..3b4f95a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -326,6 +326,7 @@ struct pv_lock_ops {
 
 struct pv_idle_ops {
void (*poll)(void);
+   void (*update)(unsigned long);
 } __no_randomize_layout;
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1b5b247..a11b2c2 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -315,6 +315,7 @@ struct pv_time_ops pv_time_ops = {
 
 struct pv_idle_ops pv_idle_ops = {
.poll = paravirt_nop,
+   .update = paravirt_nop,
 };
 
 __visible struct pv_irq_ops pv_irq_ops = {
-- 
1.8.3.1



Re: [PATCH 0/3] KVM, pkeys: fix handling of PKRU across migration

2017-08-27 Thread Yang Zhang

On 2017/8/24 18:12, Paolo Bonzini wrote:

The host pkru is restored right after vcpu exit (commit 1be0e61), so
KVM_GET_XSAVE will return the host PKRU value instead.  In general,
the PKRU value in vcpu->arch.guest_fpu.state cannot be trusted.

Series as follows:

1) fix independent bug which would cause an oops

2) remove an unnecessary abstraction

3) fix the bug

Please test the patches, as I don't have the affected hardware.  Note
that I need the results before tomorrow in order to send these patches
to Linus before going on vacation.

Thanks,

Paolo


Paolo Bonzini (3):
   KVM: x86: block guest protection keys unless the host has them enabled
   KVM: x86: simplify handling of PKRU
   KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state

  arch/x86/include/asm/fpu/internal.h |  6 +++---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/cpuid.c|  2 +-
  arch/x86/kvm/kvm_cache_regs.h   |  5 -
  arch/x86/kvm/mmu.h  |  2 +-
  arch/x86/kvm/svm.c  |  7 ---
  arch/x86/kvm/vmx.c  | 25 -
  arch/x86/kvm/x86.c  | 17 ++---
  8 files changed, 28 insertions(+), 37 deletions(-)



Reviewed-by: Yang Zhang <yang.zhang...@gmail.com>

--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/3] KVM, pkeys: fix handling of PKRU across migration

2017-08-27 Thread Yang Zhang

On 2017/8/24 18:12, Paolo Bonzini wrote:

The host pkru is restored right after vcpu exit (commit 1be0e61), so
KVM_GET_XSAVE will return the host PKRU value instead.  In general,
the PKRU value in vcpu->arch.guest_fpu.state cannot be trusted.

Series as follows:

1) fix independent bug which would cause an oops

2) remove an unnecessary abstraction

3) fix the bug

Please test the patches, as I don't have the affected hardware.  Note
that I need the results before tomorrow in order to send these patches
to Linus before going on vacation.

Thanks,

Paolo


Paolo Bonzini (3):
   KVM: x86: block guest protection keys unless the host has them enabled
   KVM: x86: simplify handling of PKRU
   KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state

  arch/x86/include/asm/fpu/internal.h |  6 +++---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/cpuid.c|  2 +-
  arch/x86/kvm/kvm_cache_regs.h   |  5 -
  arch/x86/kvm/mmu.h  |  2 +-
  arch/x86/kvm/svm.c  |  7 ---
  arch/x86/kvm/vmx.c  | 25 -
  arch/x86/kvm/x86.c  | 17 ++---
  8 files changed, 28 insertions(+), 37 deletions(-)



Reviewed-by: Yang Zhang 

--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/3] KVM, pkeys: fix handling of PKRU across migration

2017-08-24 Thread Yang Zhang

On 2017/8/24 18:12, Paolo Bonzini wrote:

The host pkru is restored right after vcpu exit (commit 1be0e61), so
KVM_GET_XSAVE will return the host PKRU value instead.  In general,
the PKRU value in vcpu->arch.guest_fpu.state cannot be trusted.

Series as follows:

1) fix independent bug which would cause an oops

2) remove an unnecessary abstraction

3) fix the bug

Please test the patches, as I don't have the affected hardware.  Note
that I need the results before tomorrow in order to send these patches
to Linus before going on vacation.


hi Quan

Can you help to test Paolo's patch?



Thanks,

Paolo


Paolo Bonzini (3):
   KVM: x86: block guest protection keys unless the host has them enabled
   KVM: x86: simplify handling of PKRU
   KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state

  arch/x86/include/asm/fpu/internal.h |  6 +++---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/cpuid.c|  2 +-
  arch/x86/kvm/kvm_cache_regs.h   |  5 -
  arch/x86/kvm/mmu.h  |  2 +-
  arch/x86/kvm/svm.c  |  7 ---
  arch/x86/kvm/vmx.c  | 25 -
  arch/x86/kvm/x86.c  | 17 ++---
  8 files changed, 28 insertions(+), 37 deletions(-)




--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/3] KVM, pkeys: fix handling of PKRU across migration

2017-08-24 Thread Yang Zhang

On 2017/8/24 18:12, Paolo Bonzini wrote:

The host pkru is restored right after vcpu exit (commit 1be0e61), so
KVM_GET_XSAVE will return the host PKRU value instead.  In general,
the PKRU value in vcpu->arch.guest_fpu.state cannot be trusted.

Series as follows:

1) fix independent bug which would cause an oops

2) remove an unnecessary abstraction

3) fix the bug

Please test the patches, as I don't have the affected hardware.  Note
that I need the results before tomorrow in order to send these patches
to Linus before going on vacation.


hi Quan

Can you help to test Paolo's patch?



Thanks,

Paolo


Paolo Bonzini (3):
   KVM: x86: block guest protection keys unless the host has them enabled
   KVM: x86: simplify handling of PKRU
   KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state

  arch/x86/include/asm/fpu/internal.h |  6 +++---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/cpuid.c|  2 +-
  arch/x86/kvm/kvm_cache_regs.h   |  5 -
  arch/x86/kvm/mmu.h  |  2 +-
  arch/x86/kvm/svm.c  |  7 ---
  arch/x86/kvm/vmx.c  | 25 -
  arch/x86/kvm/x86.c  | 17 ++---
  8 files changed, 28 insertions(+), 37 deletions(-)




--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] KVM: x86: simplify handling of PKRU

2017-08-24 Thread Yang Zhang

On 2017/8/24 17:19, Paolo Bonzini wrote:

On 24/08/2017 11:09, Yang Zhang wrote:

+if (static_cpu_has(X86_FEATURE_OSPKE) &&


We expose protection key to VM without check whether OSPKE is enabled or
not. Why not check guest's cpuid here which also can avoid unnecessary
access to pkru?


Checking guest CPUID is pretty slow.  We could check CR4.PKE though.

Also, using static_cpu_has with OSPKE is probably wrong.  But if we do
check CR4.PKE, we can check X86_FEATURE_PKU instead, so something like

if (static_cpu_has(X86_FEATURE_PKU) &&
kvm_read_cr4_bits(vcpu, X86_CR4_PKE) &&
vcpu->arch.pkru != vmx->host_pkru)

... but then, kvm_read_cr4_bits is also pretty slow---and we don't
really need it, since all CR4 writes cause a vmexit.  So for now I'd
stay with this patch, only s/static_cpu_has/boot_cpu_has/g.

Of course you can send improvements on top!


ok, since most OS distributions don't support protection key so far 
which means vcpu->arch.pkru always 0 in it and i remember host_pkru will 
be set to 5554 be default. To avoid the unnecessary access to pkru, 
how about the following change:


@@ -4338,6 +4331,9 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, 
unsigned long cr4)

return 1;
}

+   if (cr4 & X86_CR4_PKE)
+   to_vmx(vcpu)->guest_pkru_valid = true;
+
if (to_vmx(vcpu)->nested.vmxon && !nested_cr4_valid(vcpu, cr4))
return 1;

@@ -9020,8 +9016,10 @@ static void __noclone vmx_vcpu_run(struct 
kvm_vcpu *vcpu)

if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);

-   if (vmx->guest_pkru_valid)
-   __write_pkru(vmx->guest_pkru);
+   if (static_cpu_has(X86_FEATURE_OSPKE) &&
+   vmx->guest_pkru_valid &&
+   vcpu->arch.pkru != vmx->host_pkru)
+   __write_pkru(vcpu->arch.pkru);

atomic_switch_perf_msrs(vmx);
debugctlmsr = get_debugctlmsr();
@@ -9169,13 +9167,11 @@ static void __noclone vmx_vcpu_run(struct 
kvm_vcpu *vcpu)

 * back on host, so it is safe to read guest PKRU from current
 * XSAVE.
 */
-   if (boot_cpu_has(X86_FEATURE_OSPKE)) {
-   vmx->guest_pkru = __read_pkru();
-   if (vmx->guest_pkru != vmx->host_pkru) {
-   vmx->guest_pkru_valid = true;
+   if (static_cpu_has(X86_FEATURE_OSPKE) &&
+   vmx->guest_pkru_valid) {
+   vcpu->arch.pkru = __read_pkru();
+   if (vcpu->arch.pkru != vmx->host_pkru)
__write_pkru(vmx->host_pkru);
-   } else
-   vmx->guest_pkru_valid = false;
}

/*

--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] KVM: x86: simplify handling of PKRU

2017-08-24 Thread Yang Zhang

On 2017/8/24 17:19, Paolo Bonzini wrote:

On 24/08/2017 11:09, Yang Zhang wrote:

+if (static_cpu_has(X86_FEATURE_OSPKE) &&


We expose protection key to VM without check whether OSPKE is enabled or
not. Why not check guest's cpuid here which also can avoid unnecessary
access to pkru?


Checking guest CPUID is pretty slow.  We could check CR4.PKE though.

Also, using static_cpu_has with OSPKE is probably wrong.  But if we do
check CR4.PKE, we can check X86_FEATURE_PKU instead, so something like

if (static_cpu_has(X86_FEATURE_PKU) &&
kvm_read_cr4_bits(vcpu, X86_CR4_PKE) &&
vcpu->arch.pkru != vmx->host_pkru)

... but then, kvm_read_cr4_bits is also pretty slow---and we don't
really need it, since all CR4 writes cause a vmexit.  So for now I'd
stay with this patch, only s/static_cpu_has/boot_cpu_has/g.

Of course you can send improvements on top!


ok, since most OS distributions don't support protection key so far 
which means vcpu->arch.pkru always 0 in it and i remember host_pkru will 
be set to 5554 be default. To avoid the unnecessary access to pkru, 
how about the following change:


@@ -4338,6 +4331,9 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, 
unsigned long cr4)

return 1;
}

+   if (cr4 & X86_CR4_PKE)
+   to_vmx(vcpu)->guest_pkru_valid = true;
+
if (to_vmx(vcpu)->nested.vmxon && !nested_cr4_valid(vcpu, cr4))
return 1;

@@ -9020,8 +9016,10 @@ static void __noclone vmx_vcpu_run(struct 
kvm_vcpu *vcpu)

if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);

-   if (vmx->guest_pkru_valid)
-   __write_pkru(vmx->guest_pkru);
+   if (static_cpu_has(X86_FEATURE_OSPKE) &&
+   vmx->guest_pkru_valid &&
+   vcpu->arch.pkru != vmx->host_pkru)
+   __write_pkru(vcpu->arch.pkru);

atomic_switch_perf_msrs(vmx);
debugctlmsr = get_debugctlmsr();
@@ -9169,13 +9167,11 @@ static void __noclone vmx_vcpu_run(struct 
kvm_vcpu *vcpu)

 * back on host, so it is safe to read guest PKRU from current
 * XSAVE.
 */
-   if (boot_cpu_has(X86_FEATURE_OSPKE)) {
-   vmx->guest_pkru = __read_pkru();
-   if (vmx->guest_pkru != vmx->host_pkru) {
-   vmx->guest_pkru_valid = true;
+   if (static_cpu_has(X86_FEATURE_OSPKE) &&
+   vmx->guest_pkru_valid) {
+   vcpu->arch.pkru = __read_pkru();
+   if (vcpu->arch.pkru != vmx->host_pkru)
__write_pkru(vmx->host_pkru);
-   } else
-   vmx->guest_pkru_valid = false;
}

/*

--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] KVM: x86: simplify handling of PKRU

2017-08-24 Thread Yang Zhang

On 2017/8/24 5:26, Paolo Bonzini wrote:

Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
simple comparison against the host value of the register.


Thanks for refine the patches.:)



Signed-off-by: Paolo Bonzini 
---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/kvm_cache_regs.h   |  5 -
  arch/x86/kvm/mmu.h  |  2 +-
  arch/x86/kvm/svm.c  |  7 ---
  arch/x86/kvm/vmx.c  | 23 ++-
  5 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 643308143bea..beb185361045 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -491,6 +491,7 @@ struct kvm_vcpu_arch {
unsigned long cr4;
unsigned long cr4_guest_owned_bits;
unsigned long cr8;
+   u32 pkru;
u32 hflags;
u64 efer;
u64 apic_base;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 762cdf2595f9..e1e89ee4af75 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -84,11 +84,6 @@ static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu)
| ((u64)(kvm_register_read(vcpu, VCPU_REGS_RDX) & -1u) << 32);
  }
  
-static inline u32 kvm_read_pkru(struct kvm_vcpu *vcpu)

-{
-   return kvm_x86_ops->get_pkru(vcpu);
-}
-
  static inline void enter_guest_mode(struct kvm_vcpu *vcpu)
  {
vcpu->arch.hflags |= HF_GUEST_MASK;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3ed6192d93b1..1b3095264fcf 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -168,7 +168,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
* index of the protection domain, so pte_pkey * 2 is
* is the index of the first bit for the domain.
*/
-   pkru_bits = (kvm_read_pkru(vcpu) >> (pte_pkey * 2)) & 3;
+   pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
  
  		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */

offset = (pfec & ~1) +
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 32c8d8f62985..f2355135164c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1826,11 +1826,6 @@ static void svm_set_rflags(struct kvm_vcpu *vcpu, 
unsigned long rflags)
to_svm(vcpu)->vmcb->save.rflags = rflags;
  }
  
-static u32 svm_get_pkru(struct kvm_vcpu *vcpu)

-{
-   return 0;
-}
-
  static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
  {
switch (reg) {
@@ -5438,8 +5433,6 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
.get_rflags = svm_get_rflags,
.set_rflags = svm_set_rflags,
  
-	.get_pkru = svm_get_pkru,

-
.tlb_flush = svm_flush_tlb,
  
  	.run = svm_vcpu_run,

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ddabed8425b3..c603f658c42a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -639,8 +639,6 @@ struct vcpu_vmx {
  
  	u64 current_tsc_ratio;
  
-	bool guest_pkru_valid;

-   u32 guest_pkru;
u32 host_pkru;
  
  	/*

@@ -2392,11 +2390,6 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, 
unsigned long rflags)
to_vmx(vcpu)->emulation_required = emulation_required(vcpu);
  }
  
-static u32 vmx_get_pkru(struct kvm_vcpu *vcpu)

-{
-   return to_vmx(vcpu)->guest_pkru;
-}
-
  static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
  {
u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
@@ -9239,8 +9232,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);
  
-	if (vmx->guest_pkru_valid)

-   __write_pkru(vmx->guest_pkru);
+   if (static_cpu_has(X86_FEATURE_OSPKE) &&


We expose protection key to VM without check whether OSPKE is enabled or 
not. Why not check guest's cpuid here which also can avoid unnecessary 
access to pkru?



+   vcpu->arch.pkru != vmx->host_pkru)
+   __write_pkru(vcpu->arch.pkru);
  
  	atomic_switch_perf_msrs(vmx);

debugctlmsr = get_debugctlmsr();
@@ -9388,13 +9382,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
*vcpu)
 * back on host, so it is safe to read guest PKRU from current
 * XSAVE.
 */
-   if (boot_cpu_has(X86_FEATURE_OSPKE)) {
-   vmx->guest_pkru = __read_pkru();
-   if (vmx->guest_pkru != vmx->host_pkru) {
-   vmx->guest_pkru_valid = true;
+   if (static_cpu_has(X86_FEATURE_OSPKE)) {
+   vcpu->arch.pkru = __read_pkru();
+   if (vcpu->arch.pkru != vmx->host_pkru)
__write_pkru(vmx->host_pkru);
-   } else
-   vmx->guest_pkru_valid = false;
}
  
  	/*

@@ -11875,8 +11866,6 @@ static void 

Re: [PATCH 1/2] KVM: x86: simplify handling of PKRU

2017-08-24 Thread Yang Zhang

On 2017/8/24 5:26, Paolo Bonzini wrote:

Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
simple comparison against the host value of the register.


Thanks for refine the patches.:)



Signed-off-by: Paolo Bonzini 
---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/kvm_cache_regs.h   |  5 -
  arch/x86/kvm/mmu.h  |  2 +-
  arch/x86/kvm/svm.c  |  7 ---
  arch/x86/kvm/vmx.c  | 23 ++-
  5 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 643308143bea..beb185361045 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -491,6 +491,7 @@ struct kvm_vcpu_arch {
unsigned long cr4;
unsigned long cr4_guest_owned_bits;
unsigned long cr8;
+   u32 pkru;
u32 hflags;
u64 efer;
u64 apic_base;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 762cdf2595f9..e1e89ee4af75 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -84,11 +84,6 @@ static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu)
| ((u64)(kvm_register_read(vcpu, VCPU_REGS_RDX) & -1u) << 32);
  }
  
-static inline u32 kvm_read_pkru(struct kvm_vcpu *vcpu)

-{
-   return kvm_x86_ops->get_pkru(vcpu);
-}
-
  static inline void enter_guest_mode(struct kvm_vcpu *vcpu)
  {
vcpu->arch.hflags |= HF_GUEST_MASK;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3ed6192d93b1..1b3095264fcf 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -168,7 +168,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
* index of the protection domain, so pte_pkey * 2 is
* is the index of the first bit for the domain.
*/
-   pkru_bits = (kvm_read_pkru(vcpu) >> (pte_pkey * 2)) & 3;
+   pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
  
  		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */

offset = (pfec & ~1) +
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 32c8d8f62985..f2355135164c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1826,11 +1826,6 @@ static void svm_set_rflags(struct kvm_vcpu *vcpu, 
unsigned long rflags)
to_svm(vcpu)->vmcb->save.rflags = rflags;
  }
  
-static u32 svm_get_pkru(struct kvm_vcpu *vcpu)

-{
-   return 0;
-}
-
  static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
  {
switch (reg) {
@@ -5438,8 +5433,6 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
.get_rflags = svm_get_rflags,
.set_rflags = svm_set_rflags,
  
-	.get_pkru = svm_get_pkru,

-
.tlb_flush = svm_flush_tlb,
  
  	.run = svm_vcpu_run,

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ddabed8425b3..c603f658c42a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -639,8 +639,6 @@ struct vcpu_vmx {
  
  	u64 current_tsc_ratio;
  
-	bool guest_pkru_valid;

-   u32 guest_pkru;
u32 host_pkru;
  
  	/*

@@ -2392,11 +2390,6 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, 
unsigned long rflags)
to_vmx(vcpu)->emulation_required = emulation_required(vcpu);
  }
  
-static u32 vmx_get_pkru(struct kvm_vcpu *vcpu)

-{
-   return to_vmx(vcpu)->guest_pkru;
-}
-
  static u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
  {
u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
@@ -9239,8 +9232,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);
  
-	if (vmx->guest_pkru_valid)

-   __write_pkru(vmx->guest_pkru);
+   if (static_cpu_has(X86_FEATURE_OSPKE) &&


We expose protection key to VM without check whether OSPKE is enabled or 
not. Why not check guest's cpuid here which also can avoid unnecessary 
access to pkru?



+   vcpu->arch.pkru != vmx->host_pkru)
+   __write_pkru(vcpu->arch.pkru);
  
  	atomic_switch_perf_msrs(vmx);

debugctlmsr = get_debugctlmsr();
@@ -9388,13 +9382,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
*vcpu)
 * back on host, so it is safe to read guest PKRU from current
 * XSAVE.
 */
-   if (boot_cpu_has(X86_FEATURE_OSPKE)) {
-   vmx->guest_pkru = __read_pkru();
-   if (vmx->guest_pkru != vmx->host_pkru) {
-   vmx->guest_pkru_valid = true;
+   if (static_cpu_has(X86_FEATURE_OSPKE)) {
+   vcpu->arch.pkru = __read_pkru();
+   if (vcpu->arch.pkru != vmx->host_pkru)
__write_pkru(vmx->host_pkru);
-   } else
-   vmx->guest_pkru_valid = false;
}
  
  	/*

@@ -11875,8 +11866,6 @@ static void vmx_setup_mce(struct 

Re: [PATCH] kvm: VMX: do not use vm-exit instruction length for fast MMIO

2017-08-17 Thread Yang Zhang

On 2017/8/17 16:51, Wanpeng Li wrote:

2017-08-17 16:48 GMT+08:00 Yang Zhang <yang.zhang...@gmail.com>:

On 2017/8/17 16:31, Wanpeng Li wrote:


2017-08-17 16:28 GMT+08:00 Wanpeng Li <kernel...@gmail.com>:


2017-08-17 16:07 GMT+08:00 Yang Zhang <yang.zhang...@gmail.com>:


On 2017/8/17 0:56, Radim Krčmář wrote:



2017-08-16 17:10+0300, Michael S. Tsirkin:



On Wed, Aug 16, 2017 at 03:34:54PM +0200, Paolo Bonzini wrote:



Microsoft pointed out privately to me that KVM's handling of
KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is
invalid
in EPT misconfiguration vmexit handlers, because neither EPT
violations
nor misconfigurations are listed in the manual among the VM exits
that
set the VM-exit instruction length field.

While physical processors seem to set the field, this is not
architectural
and is just a side effect of the implementation.  I couldn't convince
myself of any condition on the exit qualification where VM-exit
instruction length "has" to be defined; there are no trap-like
VM-exits
that can be repurposed; and fault-like VM-exits such as
descriptor-table
exits provide no decoding information.  So I don't really see any way
to keep the full speedup.

What we can do is use EMULTYPE_SKIP; it only saves 200 clock cycles
because computing the physical RIP and reading the instruction is
expensive, but at least the eventfd is signaled before entering the
emulator.  This saves on latency.  While at it, don't check
breakpoints
when skipping the instruction, as presumably any side effect has been
exposed already.

Adding a hypercall or MSR write that does a fast MMIO write to a
physical
address would do it, but it adds hypervisor knowledge in virtio,
including
CPUID handling.  So it would be pretty ugly in the guest-side
implementation,
but if somebody wants to do it and the virtio side is acceptable to
the
virtio maintainers, I am okay with it.

Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: sta...@vger.kernel.org
Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
Suggested-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>




Jason (cc) who worked on the original optimization said he can
work to test the performance impact.
I suggest we don't rush this (it's been like this for 2 years),
and the issue seems to be largely theoretical.




Paolo, did Microsoft point it out because they hit the bug when running
KVM on Hyper-V?




Does this mean the nested emulation of EPT violation and
misconfiguration in
KVM side doesn't strictly follow the manual since we didn't hit the bug
in
KVM?



The VM-exit instruction length of vmcs12 is provided by vmcs02
(prepare_vmcs12()), so unless the length from vmcs02 is wrong. In
addition, something like mov instruction which can trigger the EPT
violation/misconfig in guest has already been decoded before executing
I think, IIUC, then exit qualification can have the information about
the instruction length.



s/exit qualification/VM-exit instruction length



According to Paolo's comment "neither EPT violations nor misconfigurations
are listed in the manual among the VM exits that set the VM-exit instruction
length field", it seems to set the instruction length in vmcs12 is not right
though it is harmless.


But Paolo also mentioned this "It just happens that the actual
condition for VM-exit instruction length being set correctly is "the
fault was taken after the accessing instruction has been decoded"."


We are talking the different thing. As manual mentioned, "All VM exits 
other than those listed in the above items leave this field undefined." 
If we set the field which is not in the listed VM exits that means our 
emulation is not correct. But i have checked the code, KVM just read the 
instruction length from hardware which means we didn't set it artificially.




Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing


Re: [PATCH] kvm: VMX: do not use vm-exit instruction length for fast MMIO

2017-08-17 Thread Yang Zhang

On 2017/8/17 16:51, Wanpeng Li wrote:

2017-08-17 16:48 GMT+08:00 Yang Zhang :

On 2017/8/17 16:31, Wanpeng Li wrote:


2017-08-17 16:28 GMT+08:00 Wanpeng Li :


2017-08-17 16:07 GMT+08:00 Yang Zhang :


On 2017/8/17 0:56, Radim Krčmář wrote:



2017-08-16 17:10+0300, Michael S. Tsirkin:



On Wed, Aug 16, 2017 at 03:34:54PM +0200, Paolo Bonzini wrote:



Microsoft pointed out privately to me that KVM's handling of
KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is
invalid
in EPT misconfiguration vmexit handlers, because neither EPT
violations
nor misconfigurations are listed in the manual among the VM exits
that
set the VM-exit instruction length field.

While physical processors seem to set the field, this is not
architectural
and is just a side effect of the implementation.  I couldn't convince
myself of any condition on the exit qualification where VM-exit
instruction length "has" to be defined; there are no trap-like
VM-exits
that can be repurposed; and fault-like VM-exits such as
descriptor-table
exits provide no decoding information.  So I don't really see any way
to keep the full speedup.

What we can do is use EMULTYPE_SKIP; it only saves 200 clock cycles
because computing the physical RIP and reading the instruction is
expensive, but at least the eventfd is signaled before entering the
emulator.  This saves on latency.  While at it, don't check
breakpoints
when skipping the instruction, as presumably any side effect has been
exposed already.

Adding a hypercall or MSR write that does a fast MMIO write to a
physical
address would do it, but it adds hypervisor knowledge in virtio,
including
CPUID handling.  So it would be pretty ugly in the guest-side
implementation,
but if somebody wants to do it and the virtio side is acceptable to
the
virtio maintainers, I am okay with it.

Cc: Michael S. Tsirkin 
Cc: sta...@vger.kernel.org
Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
Suggested-by: Radim Krčmář 
Signed-off-by: Paolo Bonzini 




Jason (cc) who worked on the original optimization said he can
work to test the performance impact.
I suggest we don't rush this (it's been like this for 2 years),
and the issue seems to be largely theoretical.




Paolo, did Microsoft point it out because they hit the bug when running
KVM on Hyper-V?




Does this mean the nested emulation of EPT violation and
misconfiguration in
KVM side doesn't strictly follow the manual since we didn't hit the bug
in
KVM?



The VM-exit instruction length of vmcs12 is provided by vmcs02
(prepare_vmcs12()), so unless the length from vmcs02 is wrong. In
addition, something like mov instruction which can trigger the EPT
violation/misconfig in guest has already been decoded before executing
I think, IIUC, then exit qualification can have the information about
the instruction length.



s/exit qualification/VM-exit instruction length



According to Paolo's comment "neither EPT violations nor misconfigurations
are listed in the manual among the VM exits that set the VM-exit instruction
length field", it seems to set the instruction length in vmcs12 is not right
though it is harmless.


But Paolo also mentioned this "It just happens that the actual
condition for VM-exit instruction length being set correctly is "the
fault was taken after the accessing instruction has been decoded"."


We are talking the different thing. As manual mentioned, "All VM exits 
other than those listed in the above items leave this field undefined." 
If we set the field which is not in the listed VM exits that means our 
emulation is not correct. But i have checked the code, KVM just read the 
instruction length from hardware which means we didn't set it artificially.




Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing


Re: [PATCH] kvm: VMX: do not use vm-exit instruction length for fast MMIO

2017-08-17 Thread Yang Zhang

On 2017/8/17 16:31, Wanpeng Li wrote:

2017-08-17 16:28 GMT+08:00 Wanpeng Li <kernel...@gmail.com>:

2017-08-17 16:07 GMT+08:00 Yang Zhang <yang.zhang...@gmail.com>:

On 2017/8/17 0:56, Radim Krčmář wrote:


2017-08-16 17:10+0300, Michael S. Tsirkin:


On Wed, Aug 16, 2017 at 03:34:54PM +0200, Paolo Bonzini wrote:


Microsoft pointed out privately to me that KVM's handling of
KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is
invalid
in EPT misconfiguration vmexit handlers, because neither EPT violations
nor misconfigurations are listed in the manual among the VM exits that
set the VM-exit instruction length field.

While physical processors seem to set the field, this is not
architectural
and is just a side effect of the implementation.  I couldn't convince
myself of any condition on the exit qualification where VM-exit
instruction length "has" to be defined; there are no trap-like VM-exits
that can be repurposed; and fault-like VM-exits such as descriptor-table
exits provide no decoding information.  So I don't really see any way
to keep the full speedup.

What we can do is use EMULTYPE_SKIP; it only saves 200 clock cycles
because computing the physical RIP and reading the instruction is
expensive, but at least the eventfd is signaled before entering the
emulator.  This saves on latency.  While at it, don't check breakpoints
when skipping the instruction, as presumably any side effect has been
exposed already.

Adding a hypercall or MSR write that does a fast MMIO write to a
physical
address would do it, but it adds hypervisor knowledge in virtio,
including
CPUID handling.  So it would be pretty ugly in the guest-side
implementation,
but if somebody wants to do it and the virtio side is acceptable to the
virtio maintainers, I am okay with it.

Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: sta...@vger.kernel.org
Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
Suggested-by: Radim Krčmář <rkrc...@redhat.com>
Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>



Jason (cc) who worked on the original optimization said he can
work to test the performance impact.
I suggest we don't rush this (it's been like this for 2 years),
and the issue seems to be largely theoretical.



Paolo, did Microsoft point it out because they hit the bug when running
KVM on Hyper-V?



Does this mean the nested emulation of EPT violation and misconfiguration in
KVM side doesn't strictly follow the manual since we didn't hit the bug in
KVM?


The VM-exit instruction length of vmcs12 is provided by vmcs02
(prepare_vmcs12()), so unless the length from vmcs02 is wrong. In
addition, something like mov instruction which can trigger the EPT
violation/misconfig in guest has already been decoded before executing
I think, IIUC, then exit qualification can have the information about
the instruction length.


s/exit qualification/VM-exit instruction length


According to Paolo's comment "neither EPT violations nor 
misconfigurations are listed in the manual among the VM exits that set 
the VM-exit instruction length field", it seems to set the instruction 
length in vmcs12 is not right though it is harmless.


--
Yang
Alibaba Cloud Computing


Re: [PATCH] kvm: VMX: do not use vm-exit instruction length for fast MMIO

2017-08-17 Thread Yang Zhang

On 2017/8/17 16:31, Wanpeng Li wrote:

2017-08-17 16:28 GMT+08:00 Wanpeng Li :

2017-08-17 16:07 GMT+08:00 Yang Zhang :

On 2017/8/17 0:56, Radim Krčmář wrote:


2017-08-16 17:10+0300, Michael S. Tsirkin:


On Wed, Aug 16, 2017 at 03:34:54PM +0200, Paolo Bonzini wrote:


Microsoft pointed out privately to me that KVM's handling of
KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is
invalid
in EPT misconfiguration vmexit handlers, because neither EPT violations
nor misconfigurations are listed in the manual among the VM exits that
set the VM-exit instruction length field.

While physical processors seem to set the field, this is not
architectural
and is just a side effect of the implementation.  I couldn't convince
myself of any condition on the exit qualification where VM-exit
instruction length "has" to be defined; there are no trap-like VM-exits
that can be repurposed; and fault-like VM-exits such as descriptor-table
exits provide no decoding information.  So I don't really see any way
to keep the full speedup.

What we can do is use EMULTYPE_SKIP; it only saves 200 clock cycles
because computing the physical RIP and reading the instruction is
expensive, but at least the eventfd is signaled before entering the
emulator.  This saves on latency.  While at it, don't check breakpoints
when skipping the instruction, as presumably any side effect has been
exposed already.

Adding a hypercall or MSR write that does a fast MMIO write to a
physical
address would do it, but it adds hypervisor knowledge in virtio,
including
CPUID handling.  So it would be pretty ugly in the guest-side
implementation,
but if somebody wants to do it and the virtio side is acceptable to the
virtio maintainers, I am okay with it.

Cc: Michael S. Tsirkin 
Cc: sta...@vger.kernel.org
Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
Suggested-by: Radim Krčmář 
Signed-off-by: Paolo Bonzini 



Jason (cc) who worked on the original optimization said he can
work to test the performance impact.
I suggest we don't rush this (it's been like this for 2 years),
and the issue seems to be largely theoretical.



Paolo, did Microsoft point it out because they hit the bug when running
KVM on Hyper-V?



Does this mean the nested emulation of EPT violation and misconfiguration in
KVM side doesn't strictly follow the manual since we didn't hit the bug in
KVM?


The VM-exit instruction length of vmcs12 is provided by vmcs02
(prepare_vmcs12()), so unless the length from vmcs02 is wrong. In
addition, something like mov instruction which can trigger the EPT
violation/misconfig in guest has already been decoded before executing
I think, IIUC, then exit qualification can have the information about
the instruction length.


s/exit qualification/VM-exit instruction length


According to Paolo's comment "neither EPT violations nor 
misconfigurations are listed in the manual among the VM exits that set 
the VM-exit instruction length field", it seems to set the instruction 
length in vmcs12 is not right though it is harmless.


--
Yang
Alibaba Cloud Computing


Re: [PATCH] kvm: VMX: do not use vm-exit instruction length for fast MMIO

2017-08-17 Thread Yang Zhang

On 2017/8/17 0:56, Radim Krčmář wrote:

2017-08-16 17:10+0300, Michael S. Tsirkin:

On Wed, Aug 16, 2017 at 03:34:54PM +0200, Paolo Bonzini wrote:

Microsoft pointed out privately to me that KVM's handling of
KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
in EPT misconfiguration vmexit handlers, because neither EPT violations
nor misconfigurations are listed in the manual among the VM exits that
set the VM-exit instruction length field.

While physical processors seem to set the field, this is not architectural
and is just a side effect of the implementation.  I couldn't convince
myself of any condition on the exit qualification where VM-exit
instruction length "has" to be defined; there are no trap-like VM-exits
that can be repurposed; and fault-like VM-exits such as descriptor-table
exits provide no decoding information.  So I don't really see any way
to keep the full speedup.

What we can do is use EMULTYPE_SKIP; it only saves 200 clock cycles
because computing the physical RIP and reading the instruction is
expensive, but at least the eventfd is signaled before entering the
emulator.  This saves on latency.  While at it, don't check breakpoints
when skipping the instruction, as presumably any side effect has been
exposed already.

Adding a hypercall or MSR write that does a fast MMIO write to a physical
address would do it, but it adds hypervisor knowledge in virtio, including
CPUID handling.  So it would be pretty ugly in the guest-side implementation,
but if somebody wants to do it and the virtio side is acceptable to the
virtio maintainers, I am okay with it.

Cc: Michael S. Tsirkin 
Cc: sta...@vger.kernel.org
Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
Suggested-by: Radim Krčmář 
Signed-off-by: Paolo Bonzini 


Jason (cc) who worked on the original optimization said he can
work to test the performance impact.
I suggest we don't rush this (it's been like this for 2 years),
and the issue seems to be largely theoretical.


Paolo, did Microsoft point it out because they hit the bug when running
KVM on Hyper-V?


Does this mean the nested emulation of EPT violation and 
misconfiguration in KVM side doesn't strictly follow the manual since we 
didn't hit the bug in KVM?




Thanks.




--
Yang
Alibaba Cloud Computing


Re: [PATCH] kvm: VMX: do not use vm-exit instruction length for fast MMIO

2017-08-17 Thread Yang Zhang

On 2017/8/17 0:56, Radim Krčmář wrote:

2017-08-16 17:10+0300, Michael S. Tsirkin:

On Wed, Aug 16, 2017 at 03:34:54PM +0200, Paolo Bonzini wrote:

Microsoft pointed out privately to me that KVM's handling of
KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
in EPT misconfiguration vmexit handlers, because neither EPT violations
nor misconfigurations are listed in the manual among the VM exits that
set the VM-exit instruction length field.

While physical processors seem to set the field, this is not architectural
and is just a side effect of the implementation.  I couldn't convince
myself of any condition on the exit qualification where VM-exit
instruction length "has" to be defined; there are no trap-like VM-exits
that can be repurposed; and fault-like VM-exits such as descriptor-table
exits provide no decoding information.  So I don't really see any way
to keep the full speedup.

What we can do is use EMULTYPE_SKIP; it only saves 200 clock cycles
because computing the physical RIP and reading the instruction is
expensive, but at least the eventfd is signaled before entering the
emulator.  This saves on latency.  While at it, don't check breakpoints
when skipping the instruction, as presumably any side effect has been
exposed already.

Adding a hypercall or MSR write that does a fast MMIO write to a physical
address would do it, but it adds hypervisor knowledge in virtio, including
CPUID handling.  So it would be pretty ugly in the guest-side implementation,
but if somebody wants to do it and the virtio side is acceptable to the
virtio maintainers, I am okay with it.

Cc: Michael S. Tsirkin 
Cc: sta...@vger.kernel.org
Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
Suggested-by: Radim Krčmář 
Signed-off-by: Paolo Bonzini 


Jason (cc) who worked on the original optimization said he can
work to test the performance impact.
I suggest we don't rush this (it's been like this for 2 years),
and the issue seems to be largely theoretical.


Paolo, did Microsoft point it out because they hit the bug when running
KVM on Hyper-V?


Does this mean the nested emulation of EPT violation and 
misconfiguration in KVM side doesn't strictly follow the manual since we 
didn't hit the bug in KVM?




Thanks.




--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] x86/idle: add halt poll for halt idle

2017-08-17 Thread Yang Zhang

On 2017/8/16 12:04, Michael S. Tsirkin wrote:

On Thu, Jun 22, 2017 at 11:22:13AM +, root wrote:

From: Yang Zhang <yang.zhang...@gmail.com>

This patch introduce a new mechanism to poll for a while before
entering idle state.

David has a topic in KVM forum to describe the problem on current KVM VM
when running some message passing workload in KVM forum. Also, there
are some work to improve the performance in KVM, like halt polling in KVM.
But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
which introduce lot of latency.

Halt polling in KVM provide the capbility to not schedule out VCPU when
it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
for a while if there is no work inside VCPU to elimiate heavy vmexit during
in/out idle. The potential impact is it will cost more CPU cycle since we
are doing polling and may impact other task which waiting on the same
physical CPU in host.


I wonder whether you considered doing this in an idle driver.
I have a prototype patch combining this with mwait within guest -
I can post it if you are interested.


I am not sure mwait can solve this problem. But yes, if you have any 
prototype patch, i can do a test. Also, i am working on next version 
with better approach. I will post it ASAP.






Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)

before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)

after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)

Signed-off-by: Yang Zhang <yang.zhang...@gmail.com>
---
 Documentation/sysctl/kernel.txt | 10 ++
 arch/x86/kernel/process.c   | 21 +
 include/linux/kernel.h  |  3 +++
 kernel/sched/idle.c |  3 +++
 kernel/sysctl.c |  9 +
 5 files changed, 46 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..4e71bfe 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_threshold_ns[ X86 only ]
 - powersave-nap   [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.

 ==

+poll_threshold_ns: (X86 only)
+
+This parameter used to control the max wait time to poll before going
+into real idle state. By default, the values is 0 means don't poll.
+It is recommended to change the value to non-zero if running latency-bound
+workloads in VM.
+
+==
+
 powersave-nap: (PPC only)

 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0bb8842..6361783 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,10 @@
 #include 
 #include 

+#ifdef CONFIG_HYPERVISOR_GUEST
+unsigned long poll_threshold_ns;
+#endif
+
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
  * no more per-task TSS's. The TSS size is kept cacheline-aligned
@@ -313,6 +317,23 @@ static inline void play_dead(void)
 }
 #endif

+#ifdef CONFIG_HYPERVISOR_GUEST
+void arch_cpu_idle_poll(void)
+{
+   ktime_t start, cur, stop;
+
+   if (poll_threshold_ns) {
+   start = cur = ktime_get();
+   stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
+   do {
+   if (need_resched())
+   break;
+   cur = ktime_get();
+   } while (ktime_before(cur, stop));
+   }
+}
+#endif
+
 void arch_cpu_idle_enter(void)
 {
tsc_verify_tsc_adjust(false);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 13bc08a..04cf774 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -460,6 +460,9 @@ extern __scanf(2, 0)
 extern int sysctl_panic_on_stackoverflow;

 extern bool crash_kexec_post_notifiers;
+#ifdef CONFIG_HYPERVISOR_GUEST
+extern unsigned long poll_threshold_ns;
+#endif

 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2a25a9e..e789f99 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 }

 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,8 @@ static void do_idle(void)
 */

__current_set_polling();
+   arch_cpu_idle_poll();
+
tick_nohz_idle_enter();

while (!need_resched()) 

Re: [PATCH 1/2] x86/idle: add halt poll for halt idle

2017-08-17 Thread Yang Zhang

On 2017/8/16 12:04, Michael S. Tsirkin wrote:

On Thu, Jun 22, 2017 at 11:22:13AM +, root wrote:

From: Yang Zhang 

This patch introduce a new mechanism to poll for a while before
entering idle state.

David has a topic in KVM forum to describe the problem on current KVM VM
when running some message passing workload in KVM forum. Also, there
are some work to improve the performance in KVM, like halt polling in KVM.
But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
which introduce lot of latency.

Halt polling in KVM provide the capbility to not schedule out VCPU when
it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
for a while if there is no work inside VCPU to elimiate heavy vmexit during
in/out idle. The potential impact is it will cost more CPU cycle since we
are doing polling and may impact other task which waiting on the same
physical CPU in host.


I wonder whether you considered doing this in an idle driver.
I have a prototype patch combining this with mwait within guest -
I can post it if you are interested.


I am not sure mwait can solve this problem. But yes, if you have any 
prototype patch, i can do a test. Also, i am working on next version 
with better approach. I will post it ASAP.






Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)

before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)

after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)

Signed-off-by: Yang Zhang 
---
 Documentation/sysctl/kernel.txt | 10 ++
 arch/x86/kernel/process.c   | 21 +
 include/linux/kernel.h  |  3 +++
 kernel/sched/idle.c |  3 +++
 kernel/sysctl.c |  9 +
 5 files changed, 46 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..4e71bfe 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_threshold_ns[ X86 only ]
 - powersave-nap   [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.

 ==

+poll_threshold_ns: (X86 only)
+
+This parameter used to control the max wait time to poll before going
+into real idle state. By default, the values is 0 means don't poll.
+It is recommended to change the value to non-zero if running latency-bound
+workloads in VM.
+
+==
+
 powersave-nap: (PPC only)

 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0bb8842..6361783 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,10 @@
 #include 
 #include 

+#ifdef CONFIG_HYPERVISOR_GUEST
+unsigned long poll_threshold_ns;
+#endif
+
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
  * no more per-task TSS's. The TSS size is kept cacheline-aligned
@@ -313,6 +317,23 @@ static inline void play_dead(void)
 }
 #endif

+#ifdef CONFIG_HYPERVISOR_GUEST
+void arch_cpu_idle_poll(void)
+{
+   ktime_t start, cur, stop;
+
+   if (poll_threshold_ns) {
+   start = cur = ktime_get();
+   stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
+   do {
+   if (need_resched())
+   break;
+   cur = ktime_get();
+   } while (ktime_before(cur, stop));
+   }
+}
+#endif
+
 void arch_cpu_idle_enter(void)
 {
tsc_verify_tsc_adjust(false);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 13bc08a..04cf774 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -460,6 +460,9 @@ extern __scanf(2, 0)
 extern int sysctl_panic_on_stackoverflow;

 extern bool crash_kexec_post_notifiers;
+#ifdef CONFIG_HYPERVISOR_GUEST
+extern unsigned long poll_threshold_ns;
+#endif

 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2a25a9e..e789f99 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 }

 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,8 @@ static void do_idle(void)
 */

__current_set_polling();
+   arch_cpu_idle_poll();
+
tick_nohz_idle_enter();

while (!need_resched()) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a

Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-17 Thread Yang Zhang

On 2017/7/17 17:54, Alexander Graf wrote:



On 17.07.17 11:26, Yang Zhang wrote:

On 2017/7/14 17:37, Alexander Graf wrote:



On 13.07.17 13:49, Yang Zhang wrote:

On 2017/7/4 22:13, Radim Krčmář wrote:

2017-07-03 17:28+0800, Yang Zhang:

The background is that we(Alibaba Cloud) do get more and more
complaints
from our customers in both KVM and Xen compare to bare-mental.After
investigations, the root cause is known to us: big cost in message
passing
workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer


One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?


One is to remove the timer and another one is to reprogram the timer.
Normally, only one write to remove the timer.But in some cases, it
will reprogram it.




7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr
write). The
cost of such vmexits will degrades performance severely.


Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?


In the worst case, each timer will touch APIC timer twice.So it will
add additional 4 msr writse. But this is  not always true.




 Linux kernel
already provide idle=poll to mitigate the trend. But it only
eliminates the
IPI and hlt vmexit. It has nothing to do with start/stop sched
timer. A
compromise would be to turn off NOHZ kernel, but it is not the
default
config for new distributions. Same for halt-poll in KVM, it only
solve the
cost from schedule in/out in host and can not help such workload
much.

The purpose of this patch we want to improve current idle=poll
mechanism to


Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but
doesn't
get rid of the timer one.


Yes, i can try it. But MWAIT will not yield CPU, it only helps the
sibling hyperthread as you mentioned.


If you implement proper MWAIT emulation that conditionally gets en- or
disabled depending on the same halt poll dynamics that we already have
for in-host HLT handling, it will also yield the CPU.


It is hard to do . If we not intercept MWAIT instruction, there is no
chance to wake up the CPU unless an interrupt arrived or a store to
the address armed by MONITOR which is the same with idle=polling.


Yes, but you can reconfigure the VMCS/VMCB to trap on MWAIT or not trap
on it. That's something that idle=polling does not give you at all - a
guest vcpu will always use 100% CPU.


There are two things we need to figure out:
1. How and when to reconfigure the VMCS? Currently, all the knowledge 
are from guest, we don't know when to reconfigure it. Also, we cannot 
prevent guest from using MWAIT in other place if it see the feature.


2. If guest execute MWAIT without trap, since there is no way to set 
timeout for it, that would be a waste of CPU too.





The only really tricky part is how to limit the effect of MONITOR on
nested page table maintenance. But if we just set the MONITOR cache size
to 4k, well behaved guests should ideally always give us the one same
page for wakeup - which we can then leave marked as trapping.





As for the timer - are you sure the problem is really the overhead of
the timer configuration, not the latency that it takes to actually fire
the guest timer?


No, the main cost is introduced by vmexit, includes IPIs, Timer
program, HLT. David detailed it in KVM forum, you can search "Message
Passing Workloads in KVM" in google and the first link give the whole
analysis of the problem.


During time critical message passing you want to keep both vCPUs inside
the guest, yes. That again is something that guest exposed MWAIT would
buy you.


I think MWAIT only helps sibling hyper-threading case. But in real 
Cloud, hyper-threading is not always turning on, i.e. most products of 
Azure and some products of Alibaba Cloud. So it shouldn't be a big problem.




The problem is that overcommitting CPU is very expensive with anything
that does not set the guests idle at all. And not everyone can afford to
throw more CPUs at problems :).


Agree, that's the reason why we choose dynamically halt polling. But on 
other side, the cloud vendor has the knowledge to control whether turn 
on it or not. The only problem is that there is no such way for us to do 
currently.





Alex



--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-17 Thread Yang Zhang

On 2017/7/17 17:54, Alexander Graf wrote:



On 17.07.17 11:26, Yang Zhang wrote:

On 2017/7/14 17:37, Alexander Graf wrote:



On 13.07.17 13:49, Yang Zhang wrote:

On 2017/7/4 22:13, Radim Krčmář wrote:

2017-07-03 17:28+0800, Yang Zhang:

The background is that we(Alibaba Cloud) do get more and more
complaints
from our customers in both KVM and Xen compare to bare-mental.After
investigations, the root cause is known to us: big cost in message
passing
workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer


One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?


One is to remove the timer and another one is to reprogram the timer.
Normally, only one write to remove the timer.But in some cases, it
will reprogram it.




7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr
write). The
cost of such vmexits will degrades performance severely.


Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?


In the worst case, each timer will touch APIC timer twice.So it will
add additional 4 msr writse. But this is  not always true.




 Linux kernel
already provide idle=poll to mitigate the trend. But it only
eliminates the
IPI and hlt vmexit. It has nothing to do with start/stop sched
timer. A
compromise would be to turn off NOHZ kernel, but it is not the
default
config for new distributions. Same for halt-poll in KVM, it only
solve the
cost from schedule in/out in host and can not help such workload
much.

The purpose of this patch we want to improve current idle=poll
mechanism to


Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but
doesn't
get rid of the timer one.


Yes, i can try it. But MWAIT will not yield CPU, it only helps the
sibling hyperthread as you mentioned.


If you implement proper MWAIT emulation that conditionally gets en- or
disabled depending on the same halt poll dynamics that we already have
for in-host HLT handling, it will also yield the CPU.


It is hard to do . If we not intercept MWAIT instruction, there is no
chance to wake up the CPU unless an interrupt arrived or a store to
the address armed by MONITOR which is the same with idle=polling.


Yes, but you can reconfigure the VMCS/VMCB to trap on MWAIT or not trap
on it. That's something that idle=polling does not give you at all - a
guest vcpu will always use 100% CPU.


There are two things we need to figure out:
1. How and when to reconfigure the VMCS? Currently, all the knowledge 
are from guest, we don't know when to reconfigure it. Also, we cannot 
prevent guest from using MWAIT in other place if it see the feature.


2. If guest execute MWAIT without trap, since there is no way to set 
timeout for it, that would be a waste of CPU too.





The only really tricky part is how to limit the effect of MONITOR on
nested page table maintenance. But if we just set the MONITOR cache size
to 4k, well behaved guests should ideally always give us the one same
page for wakeup - which we can then leave marked as trapping.





As for the timer - are you sure the problem is really the overhead of
the timer configuration, not the latency that it takes to actually fire
the guest timer?


No, the main cost is introduced by vmexit, includes IPIs, Timer
program, HLT. David detailed it in KVM forum, you can search "Message
Passing Workloads in KVM" in google and the first link give the whole
analysis of the problem.


During time critical message passing you want to keep both vCPUs inside
the guest, yes. That again is something that guest exposed MWAIT would
buy you.


I think MWAIT only helps sibling hyper-threading case. But in real 
Cloud, hyper-threading is not always turning on, i.e. most products of 
Azure and some products of Alibaba Cloud. So it shouldn't be a big problem.




The problem is that overcommitting CPU is very expensive with anything
that does not set the guests idle at all. And not everyone can afford to
throw more CPUs at problems :).


Agree, that's the reason why we choose dynamically halt polling. But on 
other side, the cloud vendor has the knowledge to control whether turn 
on it or not. The only problem is that there is no such way for us to do 
currently.





Alex



--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-17 Thread Yang Zhang

On 2017/7/14 17:37, Alexander Graf wrote:



On 13.07.17 13:49, Yang Zhang wrote:

On 2017/7/4 22:13, Radim Krčmář wrote:

2017-07-03 17:28+0800, Yang Zhang:

The background is that we(Alibaba Cloud) do get more and more
complaints
from our customers in both KVM and Xen compare to bare-mental.After
investigations, the root cause is known to us: big cost in message
passing
workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer


One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?


One is to remove the timer and another one is to reprogram the timer.
Normally, only one write to remove the timer.But in some cases, it
will reprogram it.




7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr
write). The
cost of such vmexits will degrades performance severely.


Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?


In the worst case, each timer will touch APIC timer twice.So it will
add additional 4 msr writse. But this is  not always true.




 Linux kernel
already provide idle=poll to mitigate the trend. But it only
eliminates the
IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
compromise would be to turn off NOHZ kernel, but it is not the default
config for new distributions. Same for halt-poll in KVM, it only
solve the
cost from schedule in/out in host and can not help such workload much.

The purpose of this patch we want to improve current idle=poll
mechanism to


Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
get rid of the timer one.


Yes, i can try it. But MWAIT will not yield CPU, it only helps the
sibling hyperthread as you mentioned.


If you implement proper MWAIT emulation that conditionally gets en- or
disabled depending on the same halt poll dynamics that we already have
for in-host HLT handling, it will also yield the CPU.


It is hard to do . If we not intercept MWAIT instruction, there is no 
chance to wake up the CPU unless an interrupt arrived or a store to the 
address armed by MONITOR which is the same with idle=polling.




As for the timer - are you sure the problem is really the overhead of
the timer configuration, not the latency that it takes to actually fire
the guest timer?


No, the main cost is introduced by vmexit, includes IPIs, Timer program, 
HLT. David detailed it in KVM forum, you can search "Message Passing 
Workloads in KVM" in google and the first link give the whole analysis 
of the problem.




One major problem I see is that we configure the host hrtimer to fire at
the point in time when the guest wants to see a timer event. But in a
virtual environment, the point in time when we have to start switching
to the VM really should be a bit *before* the guest wants to be woken
up, as it takes quite some time to switch back into the VM context.


Alex



--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-17 Thread Yang Zhang

On 2017/7/14 17:37, Alexander Graf wrote:



On 13.07.17 13:49, Yang Zhang wrote:

On 2017/7/4 22:13, Radim Krčmář wrote:

2017-07-03 17:28+0800, Yang Zhang:

The background is that we(Alibaba Cloud) do get more and more
complaints
from our customers in both KVM and Xen compare to bare-mental.After
investigations, the root cause is known to us: big cost in message
passing
workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer


One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?


One is to remove the timer and another one is to reprogram the timer.
Normally, only one write to remove the timer.But in some cases, it
will reprogram it.




7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr
write). The
cost of such vmexits will degrades performance severely.


Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?


In the worst case, each timer will touch APIC timer twice.So it will
add additional 4 msr writse. But this is  not always true.




 Linux kernel
already provide idle=poll to mitigate the trend. But it only
eliminates the
IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
compromise would be to turn off NOHZ kernel, but it is not the default
config for new distributions. Same for halt-poll in KVM, it only
solve the
cost from schedule in/out in host and can not help such workload much.

The purpose of this patch we want to improve current idle=poll
mechanism to


Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
get rid of the timer one.


Yes, i can try it. But MWAIT will not yield CPU, it only helps the
sibling hyperthread as you mentioned.


If you implement proper MWAIT emulation that conditionally gets en- or
disabled depending on the same halt poll dynamics that we already have
for in-host HLT handling, it will also yield the CPU.


It is hard to do . If we not intercept MWAIT instruction, there is no 
chance to wake up the CPU unless an interrupt arrived or a store to the 
address armed by MONITOR which is the same with idle=polling.




As for the timer - are you sure the problem is really the overhead of
the timer configuration, not the latency that it takes to actually fire
the guest timer?


No, the main cost is introduced by vmexit, includes IPIs, Timer program, 
HLT. David detailed it in KVM forum, you can search "Message Passing 
Workloads in KVM" in google and the first link give the whole analysis 
of the problem.




One major problem I see is that we configure the host hrtimer to fire at
the point in time when the guest wants to see a timer event. But in a
virtual environment, the point in time when we have to start switching
to the VM really should be a bit *before* the guest wants to be woken
up, as it takes quite some time to switch back into the VM context.


Alex



--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-13 Thread Yang Zhang

On 2017/7/4 22:13, Radim Krčmář wrote:

2017-07-03 17:28+0800, Yang Zhang:

The background is that we(Alibaba Cloud) do get more and more complaints
from our customers in both KVM and Xen compare to bare-mental.After
investigations, the root cause is known to us: big cost in message passing
workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer


One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?


One is to remove the timer and another one is to reprogram the timer. 
Normally, only one write to remove the timer.But in some cases, it will 
reprogram it.





7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
cost of such vmexits will degrades performance severely.


Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?


In the worst case, each timer will touch APIC timer twice.So it will add 
additional 4 msr writse. But this is  not always true.





 Linux kernel
already provide idle=poll to mitigate the trend. But it only eliminates the
IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
compromise would be to turn off NOHZ kernel, but it is not the default
config for new distributions. Same for halt-poll in KVM, it only solve the
cost from schedule in/out in host and can not help such workload much.

The purpose of this patch we want to improve current idle=poll mechanism to


Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
get rid of the timer one.


Yes, i can try it. But MWAIT will not yield CPU, it only helps the 
sibling hyperthread as you mentioned.





use dynamic polling and do poll before touch sched timer. It should not be a
virtualization specific feature but seems bare mental have low cost to
access the MSR. So i want to only enable it in VM. Though the idea below the
patch may not so perfect to fit all conditions, it looks no worse than now.


It adds code to hot-paths (interrupt handlers) while trying to optimize
an idle-path, which is suspicious.


How about we keep current implementation and i integrate the patch to
para-virtualize part as Paolo suggested? We can continue discuss it and i
will continue to refine it if anyone has better suggestions?


I think there is a nicer solution to avoid the expensive timer rewrite:
Linux uses one-shot APIC timers and getting the timer interrupt is about
as expensive as programming the timer, so the guest can keep the timer
armed, but not re-arm it after the expiration if the CPU is idle.

This should also mitigate the problem with short idle periods, but the
optimized window is anywhere between 0 to 1ms.

Do you see disadvantages of this combined with MWAIT?

Thanks.




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-13 Thread Yang Zhang

On 2017/7/4 22:13, Radim Krčmář wrote:

2017-07-03 17:28+0800, Yang Zhang:

The background is that we(Alibaba Cloud) do get more and more complaints
from our customers in both KVM and Xen compare to bare-mental.After
investigations, the root cause is known to us: big cost in message passing
workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer


One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?


One is to remove the timer and another one is to reprogram the timer. 
Normally, only one write to remove the timer.But in some cases, it will 
reprogram it.





7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
cost of such vmexits will degrades performance severely.


Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?


In the worst case, each timer will touch APIC timer twice.So it will add 
additional 4 msr writse. But this is  not always true.





 Linux kernel
already provide idle=poll to mitigate the trend. But it only eliminates the
IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
compromise would be to turn off NOHZ kernel, but it is not the default
config for new distributions. Same for halt-poll in KVM, it only solve the
cost from schedule in/out in host and can not help such workload much.

The purpose of this patch we want to improve current idle=poll mechanism to


Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
get rid of the timer one.


Yes, i can try it. But MWAIT will not yield CPU, it only helps the 
sibling hyperthread as you mentioned.





use dynamic polling and do poll before touch sched timer. It should not be a
virtualization specific feature but seems bare mental have low cost to
access the MSR. So i want to only enable it in VM. Though the idea below the
patch may not so perfect to fit all conditions, it looks no worse than now.


It adds code to hot-paths (interrupt handlers) while trying to optimize
an idle-path, which is suspicious.


How about we keep current implementation and i integrate the patch to
para-virtualize part as Paolo suggested? We can continue discuss it and i
will continue to refine it if anyone has better suggestions?


I think there is a nicer solution to avoid the expensive timer rewrite:
Linux uses one-shot APIC timers and getting the timer interrupt is about
as expensive as programming the timer, so the guest can keep the timer
armed, but not re-arm it after the expiration if the CPU is idle.

This should also mitigate the problem with short idle periods, but the
optimized window is anywhere between 0 to 1ms.

Do you see disadvantages of this combined with MWAIT?

Thanks.




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-03 Thread Yang Zhang

On 2017/7/3 18:06, Thomas Gleixner wrote:

On Mon, 3 Jul 2017, Yang Zhang wrote:

The background is that we(Alibaba Cloud) do get more and more complaints from
our customers in both KVM and Xen compare to bare-mental.After investigations,
the root cause is known to us: big cost in message passing workload(David show
it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer
7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
cost of such vmexits will degrades performance severely. Linux kernel already
provide idle=poll to mitigate the trend. But it only eliminates the IPI and
hlt vmexit. It has nothing to do with start/stop sched timer. A compromise
would be to turn off NOHZ kernel, but it is not the default config for new
distributions.


You still can turn if off on the kernel command line via nohz=off


You are right. Senior users will turn off it manually. But it only solve 
the sched timer. They still have the IPI/hlt problem. Another point is 
we release the distribution image to customer without any extra 
configuration to avoid mismatch between VM and bare-metal. To change 
such configuration needs reboot, but some customer's business cannot be 
interrupted after they start the service(like online gaming). It would 
be better if we can provide the sysctl interface to allow run-time 
modification. By the way, idle=poll seems too heavy to use.






Thanks,

tglx




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-03 Thread Yang Zhang

On 2017/7/3 18:06, Thomas Gleixner wrote:

On Mon, 3 Jul 2017, Yang Zhang wrote:

The background is that we(Alibaba Cloud) do get more and more complaints from
our customers in both KVM and Xen compare to bare-mental.After investigations,
the root cause is known to us: big cost in message passing workload(David show
it in KVM forum 2015)

A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer
7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
cost of such vmexits will degrades performance severely. Linux kernel already
provide idle=poll to mitigate the trend. But it only eliminates the IPI and
hlt vmexit. It has nothing to do with start/stop sched timer. A compromise
would be to turn off NOHZ kernel, but it is not the default config for new
distributions.


You still can turn if off on the kernel command line via nohz=off


You are right. Senior users will turn off it manually. But it only solve 
the sched timer. They still have the IPI/hlt problem. Another point is 
we release the distribution image to customer without any extra 
configuration to avoid mismatch between VM and bare-metal. To change 
such configuration needs reboot, but some customer's business cannot be 
interrupted after they start the service(like online gaming). It would 
be better if we can provide the sysctl interface to allow run-time 
modification. By the way, idle=poll seems too heavy to use.






Thanks,

tglx




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-03 Thread Yang Zhang

On 2017/6/27 22:22, Radim Krčmář wrote:

2017-06-27 15:56+0200, Paolo Bonzini:

On 27/06/2017 15:40, Radim Krčmář wrote:

... which is not necessarily _wrong_.  It's just a different heuristic.

Right, it's just harder to use than host's single_task_running() -- the
VCPU calling vcpu_is_preempted() is never preempted, so we have to look
at other VCPUs that are not halted, but still preempted.

If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
yield to the host.  Working under the assumption that there is work for
this PCPU if other VCPUs have stuff to do.  The downside is that it
misses information about host's topology, so it would be hard to make it
work well.


I would just use vcpu_is_preempted on the current CPU.  From guest POV
this option is really a "f*** everyone else" setting just like
idle=poll, only a little more polite.


vcpu_is_preempted() on current cpu cannot return true, AFAIK.


If we've been preempted and we were polling, there are two cases.  If an
interrupt was queued while the guest was preempted, the poll will be
treated as successful anyway.


I think the poll should be treated as invalid if the window has expired
while the VCPU was preempted -- the guest can't tell whether the
interrupt arrived still within the poll window (unless we added paravirt
for that), so it shouldn't be wasting time waiting for it.


   If it hasn't, let others run---but really
that's not because the guest wants to be polite, it's to avoid that the
scheduler penalizes it excessively.


This sounds like a VM entry just to do an immediate VM exit, so paravirt
seems better here as well ... (the guest telling the host about its
window -- which could also be used to rule it out as a target in the
pause loop random kick.)


So until it's preempted, I think it's okay if the guest doesn't care
about others.  You wouldn't use this option anyway in overcommitted
situations.

(I'm still not very convinced about the idea).


Me neither.  (The same mechanism is applicable to bare-metal, but was
never used there, so I would rather bring the guest behavior closer to
bare-metal.)



The background is that we(Alibaba Cloud) do get more and more complaints 
from our customers in both KVM and Xen compare to bare-mental.After 
investigations, the root cause is known to us: big cost in message 
passing workload(David show it in KVM forum 2015)


A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer
7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). 
The cost of such vmexits will degrades performance severely. Linux 
kernel already provide idle=poll to mitigate the trend. But it only 
eliminates the IPI and hlt vmexit. It has nothing to do with start/stop 
sched timer. A compromise would be to turn off NOHZ kernel, but it is 
not the default config for new distributions. Same for halt-poll in KVM, 
it only solve the cost from schedule in/out in host and can not help 
such workload much.


The purpose of this patch we want to improve current idle=poll mechanism 
to use dynamic polling and do poll before touch sched timer. It should 
not be a virtualization specific feature but seems bare mental have low 
cost to access the MSR. So i want to only enable it in VM. Though the 
idea below the patch may not so perfect to fit all conditions, it looks 
no worse than now.
How about we keep current implementation and i integrate the patch to 
para-virtualize part as Paolo suggested? We can continue discuss it and 
i will continue to refine it if anyone has better suggestions?



--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-07-03 Thread Yang Zhang

On 2017/6/27 22:22, Radim Krčmář wrote:

2017-06-27 15:56+0200, Paolo Bonzini:

On 27/06/2017 15:40, Radim Krčmář wrote:

... which is not necessarily _wrong_.  It's just a different heuristic.

Right, it's just harder to use than host's single_task_running() -- the
VCPU calling vcpu_is_preempted() is never preempted, so we have to look
at other VCPUs that are not halted, but still preempted.

If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
yield to the host.  Working under the assumption that there is work for
this PCPU if other VCPUs have stuff to do.  The downside is that it
misses information about host's topology, so it would be hard to make it
work well.


I would just use vcpu_is_preempted on the current CPU.  From guest POV
this option is really a "f*** everyone else" setting just like
idle=poll, only a little more polite.


vcpu_is_preempted() on current cpu cannot return true, AFAIK.


If we've been preempted and we were polling, there are two cases.  If an
interrupt was queued while the guest was preempted, the poll will be
treated as successful anyway.


I think the poll should be treated as invalid if the window has expired
while the VCPU was preempted -- the guest can't tell whether the
interrupt arrived still within the poll window (unless we added paravirt
for that), so it shouldn't be wasting time waiting for it.


   If it hasn't, let others run---but really
that's not because the guest wants to be polite, it's to avoid that the
scheduler penalizes it excessively.


This sounds like a VM entry just to do an immediate VM exit, so paravirt
seems better here as well ... (the guest telling the host about its
window -- which could also be used to rule it out as a target in the
pause loop random kick.)


So until it's preempted, I think it's okay if the guest doesn't care
about others.  You wouldn't use this option anyway in overcommitted
situations.

(I'm still not very convinced about the idea).


Me neither.  (The same mechanism is applicable to bare-metal, but was
never used there, so I would rather bring the guest behavior closer to
bare-metal.)



The background is that we(Alibaba Cloud) do get more and more complaints 
from our customers in both KVM and Xen compare to bare-mental.After 
investigations, the root cause is known to us: big cost in message 
passing workload(David show it in KVM forum 2015)


A typical message workload like below:
vcpu 0 vcpu 1
1. send ipi 2.  doing hlt
3. go into idle 4.  receive ipi and wake up from hlt
5. write APIC time twice6.  write APIC time twice to
   to stop sched timer  reprogram sched timer
7. doing hlt8.  handle task and send ipi to
vcpu 0
9. same to 4.   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). 
The cost of such vmexits will degrades performance severely. Linux 
kernel already provide idle=poll to mitigate the trend. But it only 
eliminates the IPI and hlt vmexit. It has nothing to do with start/stop 
sched timer. A compromise would be to turn off NOHZ kernel, but it is 
not the default config for new distributions. Same for halt-poll in KVM, 
it only solve the cost from schedule in/out in host and can not help 
such workload much.


The purpose of this patch we want to improve current idle=poll mechanism 
to use dynamic polling and do poll before touch sched timer. It should 
not be a virtualization specific feature but seems bare mental have low 
cost to access the MSR. So i want to only enable it in VM. Though the 
idea below the patch may not so perfect to fit all conditions, it looks 
no worse than now.
How about we keep current implementation and i integrate the patch to 
para-virtualize part as Paolo suggested? We can continue discuss it and 
i will continue to refine it if anyone has better suggestions?



--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-06-27 Thread Yang Zhang

On 2017/6/23 11:58, Yang Zhang wrote:

On 2017/6/22 19:51, Paolo Bonzini wrote:



On 22/06/2017 13:22, root wrote:

 ==

+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the
poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll
time.
+By default, the values is 2.


Even before starting the debate on whether this is a good idea or a bad
idea, KVM reduces the polling value to the minimum (10 us) by default


I noticed it. It looks like the logic inside KVM is more reasonable. I
will do more testing to compare the two.


when polling fails.  Also, it shouldn't be bound to
CONFIG_HYPERVISOR_GUEST, since there's nothing specific to virtual
machines here.


Yes. The original idea to use CONFIG_HYPERVISOR_GUEST because this
mechanism will only helpful inside VM. But as Thomas mentioned on other
thread it is wrong to use it since most distribution kernel will set it
to yes and still affect the bare metal. I will integrate it with
paravirtualizaion part as you suggested in below.



Regarding the good/bad idea part, KVM's polling is made much more
acceptable by single_task_running().  At least you need to integrate it
with paravirtualization.  If the VM is scheduled out, you shrink the
polling period.  There is already vcpu_is_preempted for this, it is used
by mutexes.


I have considered single_task_running() before. But since there is no
such paravirtual interface currently and i am not sure whether it is a
information leak from host if introducing such interface, so i didn't do
it. Do you mean vcpu_is_preempted can do the same thing? I check the
code and seems it only tells whether the VCPU is scheduled out or not
which cannot satisfy the needs.


Hi Paolo

Can you help to answer my confusion? I have double checked the code, but 
still not get your point. Do you think it is necessary to introduce an 
paravirtual interface to expose single_task_running() to guest?


--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-06-27 Thread Yang Zhang

On 2017/6/23 11:58, Yang Zhang wrote:

On 2017/6/22 19:51, Paolo Bonzini wrote:



On 22/06/2017 13:22, root wrote:

 ==

+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the
poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll
time.
+By default, the values is 2.


Even before starting the debate on whether this is a good idea or a bad
idea, KVM reduces the polling value to the minimum (10 us) by default


I noticed it. It looks like the logic inside KVM is more reasonable. I
will do more testing to compare the two.


when polling fails.  Also, it shouldn't be bound to
CONFIG_HYPERVISOR_GUEST, since there's nothing specific to virtual
machines here.


Yes. The original idea to use CONFIG_HYPERVISOR_GUEST because this
mechanism will only helpful inside VM. But as Thomas mentioned on other
thread it is wrong to use it since most distribution kernel will set it
to yes and still affect the bare metal. I will integrate it with
paravirtualizaion part as you suggested in below.



Regarding the good/bad idea part, KVM's polling is made much more
acceptable by single_task_running().  At least you need to integrate it
with paravirtualization.  If the VM is scheduled out, you shrink the
polling period.  There is already vcpu_is_preempted for this, it is used
by mutexes.


I have considered single_task_running() before. But since there is no
such paravirtual interface currently and i am not sure whether it is a
information leak from host if introducing such interface, so i didn't do
it. Do you mean vcpu_is_preempted can do the same thing? I check the
code and seems it only tells whether the VCPU is scheduled out or not
which cannot satisfy the needs.


Hi Paolo

Can you help to answer my confusion? I have double checked the code, but 
still not get your point. Do you think it is necessary to introduce an 
paravirtual interface to expose single_task_running() to guest?


--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/2] x86/idle: add halt poll support

2017-06-23 Thread Yang Zhang

On 2017/6/23 12:35, Wanpeng Li wrote:

2017-06-23 12:08 GMT+08:00 Yang Zhang <yang.zhang...@gmail.com>:

On 2017/6/22 19:50, Wanpeng Li wrote:


2017-06-22 19:22 GMT+08:00 root <yang.zhang...@gmail.com>:


From: Yang Zhang <yang.zhang...@gmail.com>

Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)
before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)
after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)



If you test this after disabling the adaptive halt-polling in kvm?
What's the performance data of w/ this patchset and w/o the adaptive
halt-polling in kvm, and w/o this patchset and w/ the adaptive
halt-polling in kvm? In addition, both linux and windows guests can
get benefit as we have already done this in kvm.



I will provide more data in next version. But it doesn't conflict with


Another case I can think of is w/ both this patchset and the adaptive
halt-polling in kvm.


current halt polling inside kvm. This is just another enhancement.


I didn't look close to the patchset, however, maybe there is another
poll in the kvm part again sometimes if you fails the poll in the
guest. In addition, the adaptive halt-polling in kvm has performance
penalty when the pCPU is heavily overcommitted though there is a
single_task_running() in my testing, it is hard to accurately aware
whether there are other tasks waiting on the pCPU in the guest which
will make it worser. Depending on vcpu_is_preempted() or steal time
maybe not accurately or directly.

So I'm not sure how much sense it makes by adaptive halt-polling in
both guest and kvm. I prefer to just keep adaptive halt-polling in
kvm(then both linux/windows or other guests can get benefit) and avoid
to churn the core x86 path.


This mechanism is not specific to KVM. It is a kernel feature which can 
benefit guest when running inside X86 virtualization environment. The 
guest includes KVM,Xen,VMWARE,Hyper-v. Administrator can control KVM to 
use adaptive halt poll but he cannot control the user to use halt 
polling inside guest. Lots of user set idle=poll inside guest to improve 
performance which occupy more CPU cycles. This mechanism is a 
enhancement to it not to KVM halt polling.




Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/2] x86/idle: add halt poll support

2017-06-23 Thread Yang Zhang

On 2017/6/23 12:35, Wanpeng Li wrote:

2017-06-23 12:08 GMT+08:00 Yang Zhang :

On 2017/6/22 19:50, Wanpeng Li wrote:


2017-06-22 19:22 GMT+08:00 root :


From: Yang Zhang 

Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)
before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)
after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)



If you test this after disabling the adaptive halt-polling in kvm?
What's the performance data of w/ this patchset and w/o the adaptive
halt-polling in kvm, and w/o this patchset and w/ the adaptive
halt-polling in kvm? In addition, both linux and windows guests can
get benefit as we have already done this in kvm.



I will provide more data in next version. But it doesn't conflict with


Another case I can think of is w/ both this patchset and the adaptive
halt-polling in kvm.


current halt polling inside kvm. This is just another enhancement.


I didn't look close to the patchset, however, maybe there is another
poll in the kvm part again sometimes if you fails the poll in the
guest. In addition, the adaptive halt-polling in kvm has performance
penalty when the pCPU is heavily overcommitted though there is a
single_task_running() in my testing, it is hard to accurately aware
whether there are other tasks waiting on the pCPU in the guest which
will make it worser. Depending on vcpu_is_preempted() or steal time
maybe not accurately or directly.

So I'm not sure how much sense it makes by adaptive halt-polling in
both guest and kvm. I prefer to just keep adaptive halt-polling in
kvm(then both linux/windows or other guests can get benefit) and avoid
to churn the core x86 path.


This mechanism is not specific to KVM. It is a kernel feature which can 
benefit guest when running inside X86 virtualization environment. The 
guest includes KVM,Xen,VMWARE,Hyper-v. Administrator can control KVM to 
use adaptive halt poll but he cannot control the user to use halt 
polling inside guest. Lots of user set idle=poll inside guest to improve 
performance which occupy more CPU cycles. This mechanism is a 
enhancement to it not to KVM halt polling.




Regards,
Wanpeng Li




--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/2] x86/idle: add halt poll support

2017-06-22 Thread Yang Zhang

On 2017/6/22 19:50, Wanpeng Li wrote:

2017-06-22 19:22 GMT+08:00 root <yang.zhang...@gmail.com>:

From: Yang Zhang <yang.zhang...@gmail.com>

Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)
before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)
after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)


If you test this after disabling the adaptive halt-polling in kvm?
What's the performance data of w/ this patchset and w/o the adaptive
halt-polling in kvm, and w/o this patchset and w/ the adaptive
halt-polling in kvm? In addition, both linux and windows guests can
get benefit as we have already done this in kvm.


I will provide more data in next version. But it doesn't conflict with 
current halt polling inside kvm. This is just another enhancement.




Regards,
Wanpeng Li


Yang Zhang (2):
  x86/idle: add halt poll for halt idle
  x86/idle: use dynamic halt poll

 Documentation/sysctl/kernel.txt  | 24 ++
 arch/x86/include/asm/processor.h |  6 +++
 arch/x86/kernel/apic/apic.c  |  6 +++
 arch/x86/kernel/apic/vector.c|  1 +
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  2 +
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
 arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
 arch/x86/kernel/irq.c|  5 ++
 arch/x86/kernel/irq_work.c   |  2 +
 arch/x86/kernel/process.c| 80 
 arch/x86/kernel/smp.c|  6 +++
 include/linux/kernel.h   |  5 ++
 kernel/sched/idle.c  |  3 ++
 kernel/sysctl.c  | 23 +
 14 files changed, 167 insertions(+)

--
1.8.3.1




--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/2] x86/idle: add halt poll support

2017-06-22 Thread Yang Zhang

On 2017/6/22 19:50, Wanpeng Li wrote:

2017-06-22 19:22 GMT+08:00 root :

From: Yang Zhang 

Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)
before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)
after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)


If you test this after disabling the adaptive halt-polling in kvm?
What's the performance data of w/ this patchset and w/o the adaptive
halt-polling in kvm, and w/o this patchset and w/ the adaptive
halt-polling in kvm? In addition, both linux and windows guests can
get benefit as we have already done this in kvm.


I will provide more data in next version. But it doesn't conflict with 
current halt polling inside kvm. This is just another enhancement.




Regards,
Wanpeng Li


Yang Zhang (2):
  x86/idle: add halt poll for halt idle
  x86/idle: use dynamic halt poll

 Documentation/sysctl/kernel.txt  | 24 ++
 arch/x86/include/asm/processor.h |  6 +++
 arch/x86/kernel/apic/apic.c  |  6 +++
 arch/x86/kernel/apic/vector.c|  1 +
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  2 +
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
 arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
 arch/x86/kernel/irq.c|  5 ++
 arch/x86/kernel/irq_work.c   |  2 +
 arch/x86/kernel/process.c| 80 
 arch/x86/kernel/smp.c|  6 +++
 include/linux/kernel.h   |  5 ++
 kernel/sched/idle.c  |  3 ++
 kernel/sysctl.c  | 23 +
 14 files changed, 167 insertions(+)

--
1.8.3.1




--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] x86/idle: add halt poll for halt idle

2017-06-22 Thread Yang Zhang

On 2017/6/22 22:23, Thomas Gleixner wrote:

On Thu, 22 Jun 2017, root wrote:

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,10 @@
 #include 
 #include 

+#ifdef CONFIG_HYPERVISOR_GUEST
+unsigned long poll_threshold_ns;
+#endif
+
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
  * no more per-task TSS's. The TSS size is kept cacheline-aligned
@@ -313,6 +317,23 @@ static inline void play_dead(void)
 }
 #endif

+#ifdef CONFIG_HYPERVISOR_GUEST
+void arch_cpu_idle_poll(void)
+{
+   ktime_t start, cur, stop;
+
+   if (poll_threshold_ns) {
+   start = cur = ktime_get();
+   stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
+   do {
+   if (need_resched())
+   break;
+   cur = ktime_get();
+   } while (ktime_before(cur, stop));
+   }
+}
+#endif


Aside of the whole approach being debatable, what's the reason to make this
depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that
mechanism is worthwhile then it should go into the generic code and not
into x86. There is absolutely nothing x86 specific in that patch.

Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship
with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal.


You are right. As Paolo suggested, i will integrate it with 
paravalization code.




Thanks,

tglx




--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] x86/idle: add halt poll for halt idle

2017-06-22 Thread Yang Zhang

On 2017/6/22 22:23, Thomas Gleixner wrote:

On Thu, 22 Jun 2017, root wrote:

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,10 @@
 #include 
 #include 

+#ifdef CONFIG_HYPERVISOR_GUEST
+unsigned long poll_threshold_ns;
+#endif
+
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
  * no more per-task TSS's. The TSS size is kept cacheline-aligned
@@ -313,6 +317,23 @@ static inline void play_dead(void)
 }
 #endif

+#ifdef CONFIG_HYPERVISOR_GUEST
+void arch_cpu_idle_poll(void)
+{
+   ktime_t start, cur, stop;
+
+   if (poll_threshold_ns) {
+   start = cur = ktime_get();
+   stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
+   do {
+   if (need_resched())
+   break;
+   cur = ktime_get();
+   } while (ktime_before(cur, stop));
+   }
+}
+#endif


Aside of the whole approach being debatable, what's the reason to make this
depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that
mechanism is worthwhile then it should go into the generic code and not
into x86. There is absolutely nothing x86 specific in that patch.

Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship
with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal.


You are right. As Paolo suggested, i will integrate it with 
paravalization code.




Thanks,

tglx




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-06-22 Thread Yang Zhang

On 2017/6/22 22:32, Thomas Gleixner wrote:

On Thu, 22 Jun 2017, root wrote:

@@ -962,6 +962,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct 
pt_regs *regs)
 * interrupt lock, which is the WrongThing (tm) to do.
 */
entering_ack_irq();
+   check_poll();


No way, that we sprinkle this function into every interrupt hotpath. There
are enough genuine ways to do that w/o touching a gazillion of files.


I will find a more correct place to call this function.




 #ifdef CONFIG_HYPERVISOR_GUEST
+static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
+ unsigned int max)
+{
+   unsigned int val;
+
+   /* 10us as base poll duration */
+   if (old == 0 && grow)
+   return 1;
+
+   val = old * grow;
+   if (val > max)
+   val = max;
+
+   return val;
+}
+
+static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
+{
+   if (shrink == 0)
+   return 0;
+
+   return old / shrink;
+}
+
+void check_poll(void)
+{
+   unsigned int val, poll_duration;
+   unsigned long begin_ns, now_ns;
+
+   if (!poll_threshold_ns)
+   return;


If at all then this needs to be a static key based decision.


Sure, will do it.




+
+   begin_ns = this_cpu_read(poll_begin_ns);
+   /* Not from halt state */
+   if (!begin_ns)
+   return;


If you integrate this stuff into the proper place, then the whole mess goes
away. We really do not need another facility to track idle state. We have
enough already, really.


Agree, I will check current code to find a more proper way to do the check.



Thanks,

tglx




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-06-22 Thread Yang Zhang

On 2017/6/22 22:32, Thomas Gleixner wrote:

On Thu, 22 Jun 2017, root wrote:

@@ -962,6 +962,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct 
pt_regs *regs)
 * interrupt lock, which is the WrongThing (tm) to do.
 */
entering_ack_irq();
+   check_poll();


No way, that we sprinkle this function into every interrupt hotpath. There
are enough genuine ways to do that w/o touching a gazillion of files.


I will find a more correct place to call this function.




 #ifdef CONFIG_HYPERVISOR_GUEST
+static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
+ unsigned int max)
+{
+   unsigned int val;
+
+   /* 10us as base poll duration */
+   if (old == 0 && grow)
+   return 1;
+
+   val = old * grow;
+   if (val > max)
+   val = max;
+
+   return val;
+}
+
+static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
+{
+   if (shrink == 0)
+   return 0;
+
+   return old / shrink;
+}
+
+void check_poll(void)
+{
+   unsigned int val, poll_duration;
+   unsigned long begin_ns, now_ns;
+
+   if (!poll_threshold_ns)
+   return;


If at all then this needs to be a static key based decision.


Sure, will do it.




+
+   begin_ns = this_cpu_read(poll_begin_ns);
+   /* Not from halt state */
+   if (!begin_ns)
+   return;


If you integrate this stuff into the proper place, then the whole mess goes
away. We really do not need another facility to track idle state. We have
enough already, really.


Agree, I will check current code to find a more proper way to do the check.



Thanks,

tglx




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-06-22 Thread Yang Zhang

On 2017/6/22 19:51, Paolo Bonzini wrote:



On 22/06/2017 13:22, root wrote:

 ==

+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.


Even before starting the debate on whether this is a good idea or a bad
idea, KVM reduces the polling value to the minimum (10 us) by default


I noticed it. It looks like the logic inside KVM is more reasonable. I 
will do more testing to compare the two.



when polling fails.  Also, it shouldn't be bound to
CONFIG_HYPERVISOR_GUEST, since there's nothing specific to virtual
machines here.


Yes. The original idea to use CONFIG_HYPERVISOR_GUEST because this 
mechanism will only helpful inside VM. But as Thomas mentioned on other 
thread it is wrong to use it since most distribution kernel will set it 
to yes and still affect the bare metal. I will integrate it with 
paravirtualizaion part as you suggested in below.




Regarding the good/bad idea part, KVM's polling is made much more
acceptable by single_task_running().  At least you need to integrate it
with paravirtualization.  If the VM is scheduled out, you shrink the
polling period.  There is already vcpu_is_preempted for this, it is used
by mutexes.


I have considered single_task_running() before. But since there is no 
such paravirtual interface currently and i am not sure whether it is a 
information leak from host if introducing such interface, so i didn't do 
it. Do you mean vcpu_is_preempted can do the same thing? I check the 
code and seems it only tells whether the VCPU is scheduled out or not 
which cannot satisfy the needs.




Paolo




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/2] x86/idle: use dynamic halt poll

2017-06-22 Thread Yang Zhang

On 2017/6/22 19:51, Paolo Bonzini wrote:



On 22/06/2017 13:22, root wrote:

 ==

+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.


Even before starting the debate on whether this is a good idea or a bad
idea, KVM reduces the polling value to the minimum (10 us) by default


I noticed it. It looks like the logic inside KVM is more reasonable. I 
will do more testing to compare the two.



when polling fails.  Also, it shouldn't be bound to
CONFIG_HYPERVISOR_GUEST, since there's nothing specific to virtual
machines here.


Yes. The original idea to use CONFIG_HYPERVISOR_GUEST because this 
mechanism will only helpful inside VM. But as Thomas mentioned on other 
thread it is wrong to use it since most distribution kernel will set it 
to yes and still affect the bare metal. I will integrate it with 
paravirtualizaion part as you suggested in below.




Regarding the good/bad idea part, KVM's polling is made much more
acceptable by single_task_running().  At least you need to integrate it
with paravirtualization.  If the VM is scheduled out, you shrink the
polling period.  There is already vcpu_is_preempted for this, it is used
by mutexes.


I have considered single_task_running() before. But since there is no 
such paravirtual interface currently and i am not sure whether it is a 
information leak from host if introducing such interface, so i didn't do 
it. Do you mean vcpu_is_preempted can do the same thing? I check the 
code and seems it only tells whether the VCPU is scheduled out or not 
which cannot satisfy the needs.




Paolo




--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/2] x86/idle: add halt poll support

2017-06-22 Thread Yang Zhang

On 2017/6/22 19:22, root wrote:

From: Yang Zhang <yang.zhang...@gmail.com>


Sorry to use wrong username to send patch because i am using a new
machine which don't setup the git config well.



Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)
before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)
after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)


Yang Zhang (2):
  x86/idle: add halt poll for halt idle
  x86/idle: use dynamic halt poll

 Documentation/sysctl/kernel.txt  | 24 ++
 arch/x86/include/asm/processor.h |  6 +++
 arch/x86/kernel/apic/apic.c  |  6 +++
 arch/x86/kernel/apic/vector.c|  1 +
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  2 +
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
 arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
 arch/x86/kernel/irq.c|  5 ++
 arch/x86/kernel/irq_work.c   |  2 +
 arch/x86/kernel/process.c| 80 
 arch/x86/kernel/smp.c|  6 +++
 include/linux/kernel.h   |  5 ++
 kernel/sched/idle.c  |  3 ++
 kernel/sysctl.c  | 23 +
 14 files changed, 167 insertions(+)




--
Yang
Alibaba Cloud Computing


Re: [PATCH 0/2] x86/idle: add halt poll support

2017-06-22 Thread Yang Zhang

On 2017/6/22 19:22, root wrote:

From: Yang Zhang 


Sorry to use wrong username to send patch because i am using a new
machine which don't setup the git config well.



Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)
before patch:
200 process context switches in 4822613801ns (2411.3ns/ctxsw)
after patch:
200 process context switches in 3584098241ns (1792.0ns/ctxsw)


Yang Zhang (2):
  x86/idle: add halt poll for halt idle
  x86/idle: use dynamic halt poll

 Documentation/sysctl/kernel.txt  | 24 ++
 arch/x86/include/asm/processor.h |  6 +++
 arch/x86/kernel/apic/apic.c  |  6 +++
 arch/x86/kernel/apic/vector.c|  1 +
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  2 +
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
 arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
 arch/x86/kernel/irq.c|  5 ++
 arch/x86/kernel/irq_work.c   |  2 +
 arch/x86/kernel/process.c| 80 
 arch/x86/kernel/smp.c|  6 +++
 include/linux/kernel.h   |  5 ++
 kernel/sched/idle.c  |  3 ++
 kernel/sysctl.c  | 23 +
 14 files changed, 167 insertions(+)




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/3] kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection

2016-10-14 Thread Yang Zhang

On 2016/9/28 19:50, Paolo Bonzini wrote:



On 28/09/2016 13:40, Wu, Feng wrote:

IIUIC, the issue you describe above is that IPI for posted-interrupts may be
issued between

vcpu->mode = IN_GUEST_MODE;

and

local_irq_disable();

But if that really happens, we will call kvm_vcpu_kick() in
vmx_deliver_posted_interrupt(), hence the vcpu->mode will be changed
to EXITING_GUEST_MODE, then we will goto cancel_injection in
vcpu_enter_guest, so the posted-interrupt will be delivered to guest
in the next vmentry. Seems I cannot see the problem. Do I miss something?


No, if that happens kvm_trigger_posted_interrupt returns true, hence
kvm_vcpu_kick is not called.  With the fix, the IPI is processed as soon
as the guest enters non-root mode, and the interrupt is injected.


The other issue occurs when the IPI is sent between

kvm_x86_ops->hwapic_irr_update(vcpu,
kvm_lapic_find_highest_irr(vcpu));

and

vcpu->mode = IN_GUEST_MODE;

In this case, kvm_vcpu_kick is called but it (correctly) doesn't do
anything because it sees vcpu->mode == OUTSIDE_GUEST_MODE.  Then the
guest is entered with PIR.ON, but the PI interrupt is not pending and
hence the interrupt is never delivered to the guest.  The fix for this
is to move the RVI update after IN_GUEST_MODE.  Then the source CPU uses
the posted interrupt IPI instead of kvm_cpu_kick, and everything works.


Please ignore my previous reply. It seems you already aware the issue 
and get the resolution to fix it.:-)



--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/3] kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection

2016-10-14 Thread Yang Zhang

On 2016/9/28 19:50, Paolo Bonzini wrote:



On 28/09/2016 13:40, Wu, Feng wrote:

IIUIC, the issue you describe above is that IPI for posted-interrupts may be
issued between

vcpu->mode = IN_GUEST_MODE;

and

local_irq_disable();

But if that really happens, we will call kvm_vcpu_kick() in
vmx_deliver_posted_interrupt(), hence the vcpu->mode will be changed
to EXITING_GUEST_MODE, then we will goto cancel_injection in
vcpu_enter_guest, so the posted-interrupt will be delivered to guest
in the next vmentry. Seems I cannot see the problem. Do I miss something?


No, if that happens kvm_trigger_posted_interrupt returns true, hence
kvm_vcpu_kick is not called.  With the fix, the IPI is processed as soon
as the guest enters non-root mode, and the interrupt is injected.


The other issue occurs when the IPI is sent between

kvm_x86_ops->hwapic_irr_update(vcpu,
kvm_lapic_find_highest_irr(vcpu));

and

vcpu->mode = IN_GUEST_MODE;

In this case, kvm_vcpu_kick is called but it (correctly) doesn't do
anything because it sees vcpu->mode == OUTSIDE_GUEST_MODE.  Then the
guest is entered with PIR.ON, but the PI interrupt is not pending and
hence the interrupt is never delivered to the guest.  The fix for this
is to move the RVI update after IN_GUEST_MODE.  Then the source CPU uses
the posted interrupt IPI instead of kvm_cpu_kick, and everything works.


Please ignore my previous reply. It seems you already aware the issue 
and get the resolution to fix it.:-)



--
Yang
Alibaba Cloud Computing


Re: [PATCH 3/3] KVM: x86: do not scan IRR twice on APICv vmentry

2016-10-14 Thread Yang Zhang

On 2016/9/28 5:20, Paolo Bonzini wrote:

Calling apic_find_highest_irr results in IRR being scanned twice,
once in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
vcpu_enter_guest to use sync_pir_from_irr (with a new argument to
trigger the RVI write), and let sync_pir_from_irr get the new RVI
directly from kvm_apic_update_irr.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/lapic.c| 25 -
 arch/x86/kvm/lapic.h|  4 ++--
 arch/x86/kvm/svm.c  |  2 +-
 arch/x86/kvm/vmx.c  | 25 ++---
 arch/x86/kvm/x86.c  |  7 +++
 6 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4b20f7304b9c..f73eea243567 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -941,7 +941,7 @@ struct kvm_x86_ops {
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
-   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
+   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu, bool sync_rvi);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index be8b7ad56dd1..85b79b90e3d0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -317,7 +317,7 @@ static int find_highest_vector(void *bitmap)
 vec >= 0; vec -= APIC_VECTORS_PER_REG) {
reg = bitmap + REG_POS(vec);
if (*reg)
-   return fls(*reg) - 1 + vec;
+   return __fls(*reg) + vec;
}

return -1;
@@ -337,25 +337,32 @@ static u8 count_vectors(void *bitmap)
return count;
 }

-void __kvm_apic_update_irr(u32 *pir, void *regs)
+int __kvm_apic_update_irr(u32 *pir, void *regs)
 {
-   u32 i, pir_val;
+   u32 i, vec;
+   u32 pir_val, irr_val;
+   int max_irr = -1;

-   for (i = 0; i <= 7; i++) {
+   for (i = vec = 0; i <= 7; i++, vec += 32) {


how about ignore the first 32 vectors since they cannot be used as 
normal interrupt except nmi interrupt which is special handled.



pir_val = READ_ONCE(pir[i]);
+   irr_val = *((u32 *)(regs + APIC_IRR + i * 0x10));
if (pir_val) {
-   pir_val = xchg([i], 0);
-   *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
+   irr_val |= xchg([i], 0);
+   *((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
}
+   if (irr_val)
+   max_irr = __fls(irr_val) + vec;
}
+
+   return max_irr;
 }
 EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);

-void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
+int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
 {
struct kvm_lapic *apic = vcpu->arch.apic;

-   __kvm_apic_update_irr(pir, apic->regs);
+   return __kvm_apic_update_irr(pir, apic->regs);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);

@@ -376,7 +383,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic 
*apic)
return -1;

if (apic->vcpu->arch.apicv_active)
-   kvm_x86_ops->sync_pir_to_irr(apic->vcpu);
+   kvm_x86_ops->sync_pir_to_irr(apic->vcpu, false);
result = apic_search_irr(apic);


You have got the max_irr in sync_pir_to_irr. It seems the 
apic_search_irr() is redundant here.



ASSERT(result == -1 || result >= 16);


I think the result must >= 32. But it seems this code exists when kvm 
first merging into kernel.




diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index f60d01c29d51..fc72828c3782 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -70,8 +70,8 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, 
int len,
 bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
   int short_hand, unsigned int dest, int dest_mode);

-void __kvm_apic_update_irr(u32 *pir, void *regs);
-void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
+int __kvm_apic_update_irr(u32 *pir, void *regs);
+int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
 struct dest_map *dest_map);
 int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 9b4221471e3d..60e53fa2b554 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4383,7 +4383,7 @@ static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, 
u64 

Re: [PATCH 3/3] KVM: x86: do not scan IRR twice on APICv vmentry

2016-10-14 Thread Yang Zhang

On 2016/9/28 5:20, Paolo Bonzini wrote:

Calling apic_find_highest_irr results in IRR being scanned twice,
once in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
vcpu_enter_guest to use sync_pir_from_irr (with a new argument to
trigger the RVI write), and let sync_pir_from_irr get the new RVI
directly from kvm_apic_update_irr.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/lapic.c| 25 -
 arch/x86/kvm/lapic.h|  4 ++--
 arch/x86/kvm/svm.c  |  2 +-
 arch/x86/kvm/vmx.c  | 25 ++---
 arch/x86/kvm/x86.c  |  7 +++
 6 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4b20f7304b9c..f73eea243567 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -941,7 +941,7 @@ struct kvm_x86_ops {
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
-   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
+   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu, bool sync_rvi);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index be8b7ad56dd1..85b79b90e3d0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -317,7 +317,7 @@ static int find_highest_vector(void *bitmap)
 vec >= 0; vec -= APIC_VECTORS_PER_REG) {
reg = bitmap + REG_POS(vec);
if (*reg)
-   return fls(*reg) - 1 + vec;
+   return __fls(*reg) + vec;
}

return -1;
@@ -337,25 +337,32 @@ static u8 count_vectors(void *bitmap)
return count;
 }

-void __kvm_apic_update_irr(u32 *pir, void *regs)
+int __kvm_apic_update_irr(u32 *pir, void *regs)
 {
-   u32 i, pir_val;
+   u32 i, vec;
+   u32 pir_val, irr_val;
+   int max_irr = -1;

-   for (i = 0; i <= 7; i++) {
+   for (i = vec = 0; i <= 7; i++, vec += 32) {


how about ignore the first 32 vectors since they cannot be used as 
normal interrupt except nmi interrupt which is special handled.



pir_val = READ_ONCE(pir[i]);
+   irr_val = *((u32 *)(regs + APIC_IRR + i * 0x10));
if (pir_val) {
-   pir_val = xchg([i], 0);
-   *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
+   irr_val |= xchg([i], 0);
+   *((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
}
+   if (irr_val)
+   max_irr = __fls(irr_val) + vec;
}
+
+   return max_irr;
 }
 EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);

-void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
+int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
 {
struct kvm_lapic *apic = vcpu->arch.apic;

-   __kvm_apic_update_irr(pir, apic->regs);
+   return __kvm_apic_update_irr(pir, apic->regs);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);

@@ -376,7 +383,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic 
*apic)
return -1;

if (apic->vcpu->arch.apicv_active)
-   kvm_x86_ops->sync_pir_to_irr(apic->vcpu);
+   kvm_x86_ops->sync_pir_to_irr(apic->vcpu, false);
result = apic_search_irr(apic);


You have got the max_irr in sync_pir_to_irr. It seems the 
apic_search_irr() is redundant here.



ASSERT(result == -1 || result >= 16);


I think the result must >= 32. But it seems this code exists when kvm 
first merging into kernel.




diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index f60d01c29d51..fc72828c3782 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -70,8 +70,8 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, 
int len,
 bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
   int short_hand, unsigned int dest, int dest_mode);

-void __kvm_apic_update_irr(u32 *pir, void *regs);
-void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
+int __kvm_apic_update_irr(u32 *pir, void *regs);
+int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
 struct dest_map *dest_map);
 int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 9b4221471e3d..60e53fa2b554 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4383,7 +4383,7 @@ static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, 
u64 *eoi_exit_bitmap)

Re: [PATCH 2/3] kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection

2016-10-14 Thread Yang Zhang

On 2016/9/28 5:20, Paolo Bonzini wrote:

Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.

However, there is a race between vmx_deliver_posted_interrupt and
vcpu_enter_guest.  Fix it by disabling interrupts before vcpu->mode is
set to IN_GUEST_MODE.

Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/lapic.c | 2 --
 arch/x86/kvm/vmx.c   | 8 +---
 arch/x86/kvm/x86.c   | 9 +++--
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 63a442aefc12..be8b7ad56dd1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -356,8 +356,6 @@ void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
struct kvm_lapic *apic = vcpu->arch.apic;

__kvm_apic_update_irr(pir, apic->regs);
-
-   kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b33eee395b00..207b9aa32915 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4844,9 +4844,11 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu 
*vcpu, int vector)
if (pi_test_and_set_pir(vector, >pi_desc))
return;

-   r = pi_test_and_set_on(>pi_desc);
-   kvm_make_request(KVM_REQ_EVENT, vcpu);


Hi Paolo

I remember that a request is necessary before vcpu kick. Otherwise, the 
interrupt cannot be serviced in time. In this case, if the posted 
interrupt delivery occurs between a and c:


vcpu_enter_guest:
a. check pending interupt
b. an interrupt is delivered from other 
vcpus(vmx_deliver_posted_interrupt() is called )

c.vcpu->mode = IN_GUEST_MODE;

Previously, the vcpu will aware there is a pending request(interrupt) 
after c:

if (vcpu->request)
goto cancel_injection

with this patch, since there is no request and vcpu will continue enter 
guest without handling the pending interrupt.(kvm_vcpu_kick does nothing 
since the mode isn't equal to IN_GUEST_MODE)


Can this case happen?


-   if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu))
+   /* If a previous notification has sent the IPI, nothing to do.  */
+   if (pi_test_and_set_on(>pi_desc))
+   return;
+
+   if (!kvm_vcpu_trigger_posted_interrupt(vcpu))
kvm_vcpu_kick(vcpu);
 }

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ee8a91a78c3..604cfbfc5bee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6658,6 +6658,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops->prepare_guest_switch(vcpu);
if (vcpu->fpu_active)
kvm_load_guest_fpu(vcpu);
+
+   /*
+* Disable IRQs before setting IN_GUEST_MODE, so that
+* posted interrupts with vcpu->mode == IN_GUEST_MODE
+* always result in virtual interrupt delivery.
+*/
+   local_irq_disable();
vcpu->mode = IN_GUEST_MODE;

srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
@@ -6671,8 +6678,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 */
smp_mb__after_srcu_read_unlock();

-   local_irq_disable();
-
if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
|| need_resched() || signal_pending(current)) {
vcpu->mode = OUTSIDE_GUEST_MODE;




--
Yang
Alibaba Cloud Computing


Re: [PATCH 2/3] kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection

2016-10-14 Thread Yang Zhang

On 2016/9/28 5:20, Paolo Bonzini wrote:

Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.

However, there is a race between vmx_deliver_posted_interrupt and
vcpu_enter_guest.  Fix it by disabling interrupts before vcpu->mode is
set to IN_GUEST_MODE.

Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/lapic.c | 2 --
 arch/x86/kvm/vmx.c   | 8 +---
 arch/x86/kvm/x86.c   | 9 +++--
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 63a442aefc12..be8b7ad56dd1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -356,8 +356,6 @@ void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
struct kvm_lapic *apic = vcpu->arch.apic;

__kvm_apic_update_irr(pir, apic->regs);
-
-   kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b33eee395b00..207b9aa32915 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4844,9 +4844,11 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu 
*vcpu, int vector)
if (pi_test_and_set_pir(vector, >pi_desc))
return;

-   r = pi_test_and_set_on(>pi_desc);
-   kvm_make_request(KVM_REQ_EVENT, vcpu);


Hi Paolo

I remember that a request is necessary before vcpu kick. Otherwise, the 
interrupt cannot be serviced in time. In this case, if the posted 
interrupt delivery occurs between a and c:


vcpu_enter_guest:
a. check pending interupt
b. an interrupt is delivered from other 
vcpus(vmx_deliver_posted_interrupt() is called )

c.vcpu->mode = IN_GUEST_MODE;

Previously, the vcpu will aware there is a pending request(interrupt) 
after c:

if (vcpu->request)
goto cancel_injection

with this patch, since there is no request and vcpu will continue enter 
guest without handling the pending interrupt.(kvm_vcpu_kick does nothing 
since the mode isn't equal to IN_GUEST_MODE)


Can this case happen?


-   if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu))
+   /* If a previous notification has sent the IPI, nothing to do.  */
+   if (pi_test_and_set_on(>pi_desc))
+   return;
+
+   if (!kvm_vcpu_trigger_posted_interrupt(vcpu))
kvm_vcpu_kick(vcpu);
 }

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ee8a91a78c3..604cfbfc5bee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6658,6 +6658,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops->prepare_guest_switch(vcpu);
if (vcpu->fpu_active)
kvm_load_guest_fpu(vcpu);
+
+   /*
+* Disable IRQs before setting IN_GUEST_MODE, so that
+* posted interrupts with vcpu->mode == IN_GUEST_MODE
+* always result in virtual interrupt delivery.
+*/
+   local_irq_disable();
vcpu->mode = IN_GUEST_MODE;

srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
@@ -6671,8 +6678,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 */
smp_mb__after_srcu_read_unlock();

-   local_irq_disable();
-
if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
|| need_resched() || signal_pending(current)) {
vcpu->mode = OUTSIDE_GUEST_MODE;




--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC

2016-08-10 Thread Yang Zhang

On 2016/8/9 18:19, Wincy Van wrote:

On Tue, Aug 9, 2016 at 5:32 PM, Yang Zhang <yang.zhang...@gmail.com> wrote:

On 2016/8/9 2:16, Radim Krčmář wrote:


msr bitmap can be used to avoid a VM exit (interception) on guest MSR
accesses.  In some configurations of VMX controls, the guest can even
directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
APIC ACCESSES.

L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
To do so, L1 would first trick KVM to disable all possible interceptions
by enabling APICv features and then would turn those features off;
nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
not intercept previously enabled MSRs even though they were not safe
with the new configuration.

Correctly re-enabling interceptions is not enough as a second bug would
still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
VMCSs, so L1 could trigger a race to get the desired combination of msr
bitmap and VMX controls.

This fix allocates a msr bitmap for every L1 VCPU, allows only safe
x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
have to intercept everything anyway.

Fixes: 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap")
Reported-by: Jim Mattson <jmatt...@google.com>
Suggested-by: Wincy Van <fanwenyi0...@gmail.com>
Signed-off-by: Radim Krčmář <rkrc...@redhat.com>
---
 arch/x86/kvm/vmx.c | 107
++---
 1 file changed, 44 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a45d8580f91e..c66ac2c70d22 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -435,6 +435,8 @@ struct nested_vmx {
bool pi_pending;
u16 posted_intr_nv;

+   unsigned long *msr_bitmap;
+
struct hrtimer preemption_timer;
bool preemption_timer_expired;

@@ -924,7 +926,6 @@ static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;
 static unsigned long *vmx_msr_bitmap_legacy_x2apic;
 static unsigned long *vmx_msr_bitmap_longmode_x2apic;
-static unsigned long *vmx_msr_bitmap_nested;
 static unsigned long *vmx_vmread_bitmap;
 static unsigned long *vmx_vmwrite_bitmap;

@@ -2508,7 +2509,7 @@ static void vmx_set_msr_bitmap(struct kvm_vcpu
*vcpu)
unsigned long *msr_bitmap;

if (is_guest_mode(vcpu))
-   msr_bitmap = vmx_msr_bitmap_nested;
+   msr_bitmap = to_vmx(vcpu)->nested.msr_bitmap;
else if (cpu_has_secondary_exec_ctrls() &&
 (vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
@@ -6363,13 +6364,6 @@ static __init int hardware_setup(void)
if (!vmx_msr_bitmap_longmode_x2apic)
goto out4;

-   if (nested) {
-   vmx_msr_bitmap_nested =
-   (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_msr_bitmap_nested)
-   goto out5;
-   }
-
vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmread_bitmap)
goto out6;
@@ -6392,8 +6386,6 @@ static __init int hardware_setup(void)

memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
-   if (nested)
-   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);

if (setup_vmcs_config(_config) < 0) {
r = -EIO;
@@ -6529,9 +6521,6 @@ out8:
 out7:
free_page((unsigned long)vmx_vmread_bitmap);
 out6:
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);
-out5:
free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic);
 out4:
free_page((unsigned long)vmx_msr_bitmap_longmode);
@@ -6557,8 +6546,6 @@ static __exit void hardware_unsetup(void)
free_page((unsigned long)vmx_io_bitmap_a);
free_page((unsigned long)vmx_vmwrite_bitmap);
free_page((unsigned long)vmx_vmread_bitmap);
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);

free_kvm_area();
 }
@@ -6995,16 +6982,21 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

+   if (cpu_has_vmx_msr_bitmap()) {
+   vmx->nested.msr_bitmap =
+   (unsigned long
*)__get_free_page(GFP_KERNEL);
+   if (!vmx->nested.msr_bitmap)
+   goto out_msr_bitmap;
+   }
+



We export msr_bitmap to guest even it is not supported by hardware. So we
need to allocate msr_bitmap for L1 unconditionally.




Nice to see you, Yang :-)


Niec to see you too.



I think vmx->nested.msr_bitmap is filled by L0 and L1's settings, and
used by hardware.
L1's msr_bitmap is managed by itself, and L1 will write MSR_BITMAP field to
vmcs12->msr_bitmap, then L0 can get L1's settings to emulate 

Re: [PATCH 1/2] KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC

2016-08-10 Thread Yang Zhang

On 2016/8/9 18:19, Wincy Van wrote:

On Tue, Aug 9, 2016 at 5:32 PM, Yang Zhang  wrote:

On 2016/8/9 2:16, Radim Krčmář wrote:


msr bitmap can be used to avoid a VM exit (interception) on guest MSR
accesses.  In some configurations of VMX controls, the guest can even
directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
APIC ACCESSES.

L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
To do so, L1 would first trick KVM to disable all possible interceptions
by enabling APICv features and then would turn those features off;
nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
not intercept previously enabled MSRs even though they were not safe
with the new configuration.

Correctly re-enabling interceptions is not enough as a second bug would
still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
VMCSs, so L1 could trigger a race to get the desired combination of msr
bitmap and VMX controls.

This fix allocates a msr bitmap for every L1 VCPU, allows only safe
x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
have to intercept everything anyway.

Fixes: 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap")
Reported-by: Jim Mattson 
Suggested-by: Wincy Van 
Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/vmx.c | 107
++---
 1 file changed, 44 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a45d8580f91e..c66ac2c70d22 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -435,6 +435,8 @@ struct nested_vmx {
bool pi_pending;
u16 posted_intr_nv;

+   unsigned long *msr_bitmap;
+
struct hrtimer preemption_timer;
bool preemption_timer_expired;

@@ -924,7 +926,6 @@ static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;
 static unsigned long *vmx_msr_bitmap_legacy_x2apic;
 static unsigned long *vmx_msr_bitmap_longmode_x2apic;
-static unsigned long *vmx_msr_bitmap_nested;
 static unsigned long *vmx_vmread_bitmap;
 static unsigned long *vmx_vmwrite_bitmap;

@@ -2508,7 +2509,7 @@ static void vmx_set_msr_bitmap(struct kvm_vcpu
*vcpu)
unsigned long *msr_bitmap;

if (is_guest_mode(vcpu))
-   msr_bitmap = vmx_msr_bitmap_nested;
+   msr_bitmap = to_vmx(vcpu)->nested.msr_bitmap;
else if (cpu_has_secondary_exec_ctrls() &&
 (vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
@@ -6363,13 +6364,6 @@ static __init int hardware_setup(void)
if (!vmx_msr_bitmap_longmode_x2apic)
goto out4;

-   if (nested) {
-   vmx_msr_bitmap_nested =
-   (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_msr_bitmap_nested)
-   goto out5;
-   }
-
vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmread_bitmap)
goto out6;
@@ -6392,8 +6386,6 @@ static __init int hardware_setup(void)

memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
-   if (nested)
-   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);

if (setup_vmcs_config(_config) < 0) {
r = -EIO;
@@ -6529,9 +6521,6 @@ out8:
 out7:
free_page((unsigned long)vmx_vmread_bitmap);
 out6:
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);
-out5:
free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic);
 out4:
free_page((unsigned long)vmx_msr_bitmap_longmode);
@@ -6557,8 +6546,6 @@ static __exit void hardware_unsetup(void)
free_page((unsigned long)vmx_io_bitmap_a);
free_page((unsigned long)vmx_vmwrite_bitmap);
free_page((unsigned long)vmx_vmread_bitmap);
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);

free_kvm_area();
 }
@@ -6995,16 +6982,21 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

+   if (cpu_has_vmx_msr_bitmap()) {
+   vmx->nested.msr_bitmap =
+   (unsigned long
*)__get_free_page(GFP_KERNEL);
+   if (!vmx->nested.msr_bitmap)
+   goto out_msr_bitmap;
+   }
+



We export msr_bitmap to guest even it is not supported by hardware. So we
need to allocate msr_bitmap for L1 unconditionally.




Nice to see you, Yang :-)


Niec to see you too.



I think vmx->nested.msr_bitmap is filled by L0 and L1's settings, and
used by hardware.
L1's msr_bitmap is managed by itself, and L1 will write MSR_BITMAP field to
vmcs12->msr_bitmap, then L0 can get L1's settings to emulate MSR_BITMAP.

Am I missed something?


You are correct!



Thanks,
Wincy


vmx->nested.cach

Re: [PATCH 1/2] KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC

2016-08-10 Thread Yang Zhang

On 2016/8/9 20:23, Radim Krčmář wrote:

2016-08-09 17:32+0800, Yang Zhang:

On 2016/8/9 2:16, Radim Krčmář wrote:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
@@ -6995,16 +6982,21 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

+   if (cpu_has_vmx_msr_bitmap()) {
+   vmx->nested.msr_bitmap =
+   (unsigned long *)__get_free_page(GFP_KERNEL);
+   if (!vmx->nested.msr_bitmap)
+   goto out_msr_bitmap;
+   }
+


We export msr_bitmap to guest even it is not supported by hardware. So we
need to allocate msr_bitmap for L1 unconditionally.


We do emulate the feature, but the vmx->nested.msr_bitmap is used only
if VMX supports it to avoid some VM exits:


@@ -9957,10 +9938,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
}

if (cpu_has_vmx_msr_bitmap() &&
-   exec_control & CPU_BASED_USE_MSR_BITMAPS) {
-   nested_vmx_merge_msr_bitmap(vcpu, vmcs12);
-   /* MSR_BITMAP will be set by following vmx_set_efer. */
-   } else
+   exec_control & CPU_BASED_USE_MSR_BITMAPS &&
+   nested_vmx_merge_msr_bitmap(vcpu, vmcs12))
+   ; /* MSR_BITMAP will be set by following vmx_set_efer. */
+   else
exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;


The else branch is taken if !cpu_has_vmx_msr_bitmap() and disables msr
bitmaps.  Similar check for vmx_set_msr_bitmap(), so the NULL doesn't
even get written to VMCS.

KVM always uses L1's msr bitmaps when emulating the feature.



Yes, you are right. I forget the bitmap only used by hardware.:(

--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC

2016-08-10 Thread Yang Zhang

On 2016/8/9 20:23, Radim Krčmář wrote:

2016-08-09 17:32+0800, Yang Zhang:

On 2016/8/9 2:16, Radim Krčmář wrote:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
@@ -6995,16 +6982,21 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

+   if (cpu_has_vmx_msr_bitmap()) {
+   vmx->nested.msr_bitmap =
+   (unsigned long *)__get_free_page(GFP_KERNEL);
+   if (!vmx->nested.msr_bitmap)
+   goto out_msr_bitmap;
+   }
+


We export msr_bitmap to guest even it is not supported by hardware. So we
need to allocate msr_bitmap for L1 unconditionally.


We do emulate the feature, but the vmx->nested.msr_bitmap is used only
if VMX supports it to avoid some VM exits:


@@ -9957,10 +9938,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
}

if (cpu_has_vmx_msr_bitmap() &&
-   exec_control & CPU_BASED_USE_MSR_BITMAPS) {
-   nested_vmx_merge_msr_bitmap(vcpu, vmcs12);
-   /* MSR_BITMAP will be set by following vmx_set_efer. */
-   } else
+   exec_control & CPU_BASED_USE_MSR_BITMAPS &&
+   nested_vmx_merge_msr_bitmap(vcpu, vmcs12))
+   ; /* MSR_BITMAP will be set by following vmx_set_efer. */
+   else
exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;


The else branch is taken if !cpu_has_vmx_msr_bitmap() and disables msr
bitmaps.  Similar check for vmx_set_msr_bitmap(), so the NULL doesn't
even get written to VMCS.

KVM always uses L1's msr bitmaps when emulating the feature.



Yes, you are right. I forget the bitmap only used by hardware.:(

--
Yang
Alibaba Cloud Computing


Re: [PATCH 1/2] KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC

2016-08-09 Thread Yang Zhang

On 2016/8/9 2:16, Radim Krčmář wrote:

msr bitmap can be used to avoid a VM exit (interception) on guest MSR
accesses.  In some configurations of VMX controls, the guest can even
directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
APIC ACCESSES.

L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
To do so, L1 would first trick KVM to disable all possible interceptions
by enabling APICv features and then would turn those features off;
nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
not intercept previously enabled MSRs even though they were not safe
with the new configuration.

Correctly re-enabling interceptions is not enough as a second bug would
still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
VMCSs, so L1 could trigger a race to get the desired combination of msr
bitmap and VMX controls.

This fix allocates a msr bitmap for every L1 VCPU, allows only safe
x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
have to intercept everything anyway.

Fixes: 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap")
Reported-by: Jim Mattson 
Suggested-by: Wincy Van 
Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/vmx.c | 107 ++---
 1 file changed, 44 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a45d8580f91e..c66ac2c70d22 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -435,6 +435,8 @@ struct nested_vmx {
bool pi_pending;
u16 posted_intr_nv;

+   unsigned long *msr_bitmap;
+
struct hrtimer preemption_timer;
bool preemption_timer_expired;

@@ -924,7 +926,6 @@ static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;
 static unsigned long *vmx_msr_bitmap_legacy_x2apic;
 static unsigned long *vmx_msr_bitmap_longmode_x2apic;
-static unsigned long *vmx_msr_bitmap_nested;
 static unsigned long *vmx_vmread_bitmap;
 static unsigned long *vmx_vmwrite_bitmap;

@@ -2508,7 +2509,7 @@ static void vmx_set_msr_bitmap(struct kvm_vcpu *vcpu)
unsigned long *msr_bitmap;

if (is_guest_mode(vcpu))
-   msr_bitmap = vmx_msr_bitmap_nested;
+   msr_bitmap = to_vmx(vcpu)->nested.msr_bitmap;
else if (cpu_has_secondary_exec_ctrls() &&
 (vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
@@ -6363,13 +6364,6 @@ static __init int hardware_setup(void)
if (!vmx_msr_bitmap_longmode_x2apic)
goto out4;

-   if (nested) {
-   vmx_msr_bitmap_nested =
-   (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_msr_bitmap_nested)
-   goto out5;
-   }
-
vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmread_bitmap)
goto out6;
@@ -6392,8 +6386,6 @@ static __init int hardware_setup(void)

memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
-   if (nested)
-   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);

if (setup_vmcs_config(_config) < 0) {
r = -EIO;
@@ -6529,9 +6521,6 @@ out8:
 out7:
free_page((unsigned long)vmx_vmread_bitmap);
 out6:
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);
-out5:
free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic);
 out4:
free_page((unsigned long)vmx_msr_bitmap_longmode);
@@ -6557,8 +6546,6 @@ static __exit void hardware_unsetup(void)
free_page((unsigned long)vmx_io_bitmap_a);
free_page((unsigned long)vmx_vmwrite_bitmap);
free_page((unsigned long)vmx_vmread_bitmap);
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);

free_kvm_area();
 }
@@ -6995,16 +6982,21 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

+   if (cpu_has_vmx_msr_bitmap()) {
+   vmx->nested.msr_bitmap =
+   (unsigned long *)__get_free_page(GFP_KERNEL);
+   if (!vmx->nested.msr_bitmap)
+   goto out_msr_bitmap;
+   }
+


We export msr_bitmap to guest even it is not supported by hardware. So 
we need to allocate msr_bitmap for L1 unconditionally.



vmx->nested.cached_vmcs12 = kmalloc(VMCS12_SIZE, GFP_KERNEL);
if (!vmx->nested.cached_vmcs12)
-   return -ENOMEM;
+   goto out_cached_vmcs12;

if (enable_shadow_vmcs) {
shadow_vmcs = alloc_vmcs();
-   if (!shadow_vmcs) {
-   kfree(vmx->nested.cached_vmcs12);
-   return -ENOMEM;
-   }
+   if (!shadow_vmcs)
+ 

Re: [PATCH 1/2] KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC

2016-08-09 Thread Yang Zhang

On 2016/8/9 2:16, Radim Krčmář wrote:

msr bitmap can be used to avoid a VM exit (interception) on guest MSR
accesses.  In some configurations of VMX controls, the guest can even
directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
APIC ACCESSES.

L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
To do so, L1 would first trick KVM to disable all possible interceptions
by enabling APICv features and then would turn those features off;
nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
not intercept previously enabled MSRs even though they were not safe
with the new configuration.

Correctly re-enabling interceptions is not enough as a second bug would
still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
VMCSs, so L1 could trigger a race to get the desired combination of msr
bitmap and VMX controls.

This fix allocates a msr bitmap for every L1 VCPU, allows only safe
x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
have to intercept everything anyway.

Fixes: 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap")
Reported-by: Jim Mattson 
Suggested-by: Wincy Van 
Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/vmx.c | 107 ++---
 1 file changed, 44 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a45d8580f91e..c66ac2c70d22 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -435,6 +435,8 @@ struct nested_vmx {
bool pi_pending;
u16 posted_intr_nv;

+   unsigned long *msr_bitmap;
+
struct hrtimer preemption_timer;
bool preemption_timer_expired;

@@ -924,7 +926,6 @@ static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;
 static unsigned long *vmx_msr_bitmap_legacy_x2apic;
 static unsigned long *vmx_msr_bitmap_longmode_x2apic;
-static unsigned long *vmx_msr_bitmap_nested;
 static unsigned long *vmx_vmread_bitmap;
 static unsigned long *vmx_vmwrite_bitmap;

@@ -2508,7 +2509,7 @@ static void vmx_set_msr_bitmap(struct kvm_vcpu *vcpu)
unsigned long *msr_bitmap;

if (is_guest_mode(vcpu))
-   msr_bitmap = vmx_msr_bitmap_nested;
+   msr_bitmap = to_vmx(vcpu)->nested.msr_bitmap;
else if (cpu_has_secondary_exec_ctrls() &&
 (vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
@@ -6363,13 +6364,6 @@ static __init int hardware_setup(void)
if (!vmx_msr_bitmap_longmode_x2apic)
goto out4;

-   if (nested) {
-   vmx_msr_bitmap_nested =
-   (unsigned long *)__get_free_page(GFP_KERNEL);
-   if (!vmx_msr_bitmap_nested)
-   goto out5;
-   }
-
vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmread_bitmap)
goto out6;
@@ -6392,8 +6386,6 @@ static __init int hardware_setup(void)

memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
-   if (nested)
-   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);

if (setup_vmcs_config(_config) < 0) {
r = -EIO;
@@ -6529,9 +6521,6 @@ out8:
 out7:
free_page((unsigned long)vmx_vmread_bitmap);
 out6:
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);
-out5:
free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic);
 out4:
free_page((unsigned long)vmx_msr_bitmap_longmode);
@@ -6557,8 +6546,6 @@ static __exit void hardware_unsetup(void)
free_page((unsigned long)vmx_io_bitmap_a);
free_page((unsigned long)vmx_vmwrite_bitmap);
free_page((unsigned long)vmx_vmread_bitmap);
-   if (nested)
-   free_page((unsigned long)vmx_msr_bitmap_nested);

free_kvm_area();
 }
@@ -6995,16 +6982,21 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

+   if (cpu_has_vmx_msr_bitmap()) {
+   vmx->nested.msr_bitmap =
+   (unsigned long *)__get_free_page(GFP_KERNEL);
+   if (!vmx->nested.msr_bitmap)
+   goto out_msr_bitmap;
+   }
+


We export msr_bitmap to guest even it is not supported by hardware. So 
we need to allocate msr_bitmap for L1 unconditionally.



vmx->nested.cached_vmcs12 = kmalloc(VMCS12_SIZE, GFP_KERNEL);
if (!vmx->nested.cached_vmcs12)
-   return -ENOMEM;
+   goto out_cached_vmcs12;

if (enable_shadow_vmcs) {
shadow_vmcs = alloc_vmcs();
-   if (!shadow_vmcs) {
-   kfree(vmx->nested.cached_vmcs12);
-   return -ENOMEM;
-   }
+   if (!shadow_vmcs)
+   goto out_shadow_vmcs;
/* mark vmcs as 

Re: [RFC PATCH 4/4] KVM: vmx: add support for emulating UMIP

2016-07-13 Thread Yang Zhang

On 2016/7/13 17:35, Paolo Bonzini wrote:



On 13/07/2016 11:21, Yang Zhang wrote:


+static int handle_desc(struct kvm_vcpu *vcpu)
+{
+WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP));


I think WARN_ON is too heavy since a malicious guest may trigger it always.


I missed this---how so?  Setting the bit is under "if ((cr4 &
X86_CR4_UMIP) && !boot_cpu_has(X86_FEATURE_UMIP))".


Sorry, I consider it under my previous suggestion(setting it 
unconditionally). :(


--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH 4/4] KVM: vmx: add support for emulating UMIP

2016-07-13 Thread Yang Zhang

On 2016/7/13 17:35, Paolo Bonzini wrote:



On 13/07/2016 11:21, Yang Zhang wrote:


+static int handle_desc(struct kvm_vcpu *vcpu)
+{
+WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP));


I think WARN_ON is too heavy since a malicious guest may trigger it always.


I missed this---how so?  Setting the bit is under "if ((cr4 &
X86_CR4_UMIP) && !boot_cpu_has(X86_FEATURE_UMIP))".


Sorry, I consider it under my previous suggestion(setting it 
unconditionally). :(


--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH 4/4] KVM: vmx: add support for emulating UMIP

2016-07-13 Thread Yang Zhang

On 2016/7/13 17:35, Paolo Bonzini wrote:



On 13/07/2016 11:21, Yang Zhang wrote:


+if ((cr4 & X86_CR4_UMIP) && !boot_cpu_has(X86_FEATURE_UMIP)) {
+vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
+  SECONDARY_EXEC_DESC);
+hw_cr4 &= ~X86_CR4_UMIP;
+} else
+vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
+SECONDARY_EXEC_DESC);
+


Since the faults based on privilege level have priority over VM exits.
So we don't need to enable/disable SECONDARY_EXEC_DESC dynamically.
Instead, we can set it unconditionally.


I'm setting it dynamically because it slows down LGDT, LLDT, LIDT and LTR.


You are right. And SGDT, SIDT, SLDT, SMSW, STR also will be intercepted 
even in CPL 0 if we don't disable it.


--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH 4/4] KVM: vmx: add support for emulating UMIP

2016-07-13 Thread Yang Zhang

On 2016/7/13 17:35, Paolo Bonzini wrote:



On 13/07/2016 11:21, Yang Zhang wrote:


+if ((cr4 & X86_CR4_UMIP) && !boot_cpu_has(X86_FEATURE_UMIP)) {
+vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
+  SECONDARY_EXEC_DESC);
+hw_cr4 &= ~X86_CR4_UMIP;
+} else
+vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
+SECONDARY_EXEC_DESC);
+


Since the faults based on privilege level have priority over VM exits.
So we don't need to enable/disable SECONDARY_EXEC_DESC dynamically.
Instead, we can set it unconditionally.


I'm setting it dynamically because it slows down LGDT, LLDT, LIDT and LTR.


You are right. And SGDT, SIDT, SLDT, SMSW, STR also will be intercepted 
even in CPL 0 if we don't disable it.


--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH 4/4] KVM: vmx: add support for emulating UMIP

2016-07-13 Thread Yang Zhang

On 2016/7/13 3:20, Paolo Bonzini wrote:

UMIP (User-Mode Instruction Prevention) is a feature of future
Intel processors (Cannonlake?) that blocks SLDT, SGDT, STR, SIDT
and SMSW from user-mode processes.

On Intel systems it's *almost* possible to emulate it; it slows
down the instructions when they're executed in ring 0, but they
are really never executed in practice.  The catch is that SMSW
doesn't cause a vmexit, and hence SMSW will not fault.

When UMIP is enabled but not supported by the host, descriptor table
exits are enabled, and the emulator takes care of injecting a #GP when
any of SLDT, SGDT, STR, SIDT are encountered.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/vmx.h  |  1 +
 arch/x86/include/uapi/asm/vmx.h |  4 
 arch/x86/kvm/vmx.c  | 36 ++--
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 14c63c7e8337..8b95f0ec0190 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -60,6 +60,7 @@
  */
 #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x0001
 #define SECONDARY_EXEC_ENABLE_EPT   0x0002
+#define SECONDARY_EXEC_DESC0x0004
 #define SECONDARY_EXEC_RDTSCP  0x0008
 #define SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x0010
 #define SECONDARY_EXEC_ENABLE_VPID  0x0020
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 5b15d94a33f8..2f9a69b9559c 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,8 @@
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS 44
 #define EXIT_REASON_EOI_INDUCED 45
+#define EXIT_REASON_GDTR_IDTR   46
+#define EXIT_REASON_LDTR_TR 47
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_INVEPT  50
@@ -114,6 +116,8 @@
{ EXIT_REASON_MCE_DURING_VMENTRY,"MCE_DURING_VMENTRY" }, \
{ EXIT_REASON_TPR_BELOW_THRESHOLD,   "TPR_BELOW_THRESHOLD" }, \
{ EXIT_REASON_APIC_ACCESS,   "APIC_ACCESS" }, \
+   { EXIT_REASON_GDTR_IDTR, "GDTR_IDTR" }, \
+   { EXIT_REASON_LDTR_TR,   "LDTR_TR" }, \
{ EXIT_REASON_EPT_VIOLATION, "EPT_VIOLATION" }, \
{ EXIT_REASON_EPT_MISCONFIG, "EPT_MISCONFIG" }, \
{ EXIT_REASON_INVEPT,"INVEPT" }, \
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b133fc3d6a7d..0706e363c5e3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3342,6 +3342,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_ENABLE_EPT |
SECONDARY_EXEC_UNRESTRICTED_GUEST |
SECONDARY_EXEC_PAUSE_LOOP_EXITING |
+   SECONDARY_EXEC_DESC |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_ENABLE_INVPCID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
@@ -3967,6 +3968,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
(to_vmx(vcpu)->rmode.vm86_active ?
 KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);

+   if ((cr4 & X86_CR4_UMIP) && !boot_cpu_has(X86_FEATURE_UMIP)) {
+   vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
+ SECONDARY_EXEC_DESC);
+   hw_cr4 &= ~X86_CR4_UMIP;
+   } else
+   vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
+   SECONDARY_EXEC_DESC);
+


Since the faults based on privilege level have priority over VM exits. 
So we don't need to enable/disable SECONDARY_EXEC_DESC dynamically. 
Instead, we can set it unconditionally.



if (cr4 & X86_CR4_VMXE) {
/*
 * To use VMXON (and later other VMX instructions), a guest
@@ -4913,6 +4922,11 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 {
u32 exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
+
+   /* SECONDARY_EXEC_DESC is enabled/disabled on writes to CR4.UMIP,
+* in vmx_set_cr4.  */
+   exec_control &= ~SECONDARY_EXEC_DESC;
+
if (!cpu_need_virtualize_apic_accesses(>vcpu))
exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
if (vmx->vpid == 0)
@@ -5649,6 +5663,12 @@ static void handle_clts(struct kvm_vcpu *vcpu)
vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
 }

+static int handle_desc(struct kvm_vcpu *vcpu)
+{
+   WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP));


I think WARN_ON is too heavy since a malicious guest may trigger it always.


+   return emulate_instruction(vcpu, 0) == EMULATE_DONE;
+}
+
 static int handle_cr(struct 

Re: [RFC PATCH 4/4] KVM: vmx: add support for emulating UMIP

2016-07-13 Thread Yang Zhang

On 2016/7/13 3:20, Paolo Bonzini wrote:

UMIP (User-Mode Instruction Prevention) is a feature of future
Intel processors (Cannonlake?) that blocks SLDT, SGDT, STR, SIDT
and SMSW from user-mode processes.

On Intel systems it's *almost* possible to emulate it; it slows
down the instructions when they're executed in ring 0, but they
are really never executed in practice.  The catch is that SMSW
doesn't cause a vmexit, and hence SMSW will not fault.

When UMIP is enabled but not supported by the host, descriptor table
exits are enabled, and the emulator takes care of injecting a #GP when
any of SLDT, SGDT, STR, SIDT are encountered.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/include/asm/vmx.h  |  1 +
 arch/x86/include/uapi/asm/vmx.h |  4 
 arch/x86/kvm/vmx.c  | 36 ++--
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 14c63c7e8337..8b95f0ec0190 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -60,6 +60,7 @@
  */
 #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x0001
 #define SECONDARY_EXEC_ENABLE_EPT   0x0002
+#define SECONDARY_EXEC_DESC0x0004
 #define SECONDARY_EXEC_RDTSCP  0x0008
 #define SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x0010
 #define SECONDARY_EXEC_ENABLE_VPID  0x0020
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 5b15d94a33f8..2f9a69b9559c 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,8 @@
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS 44
 #define EXIT_REASON_EOI_INDUCED 45
+#define EXIT_REASON_GDTR_IDTR   46
+#define EXIT_REASON_LDTR_TR 47
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_INVEPT  50
@@ -114,6 +116,8 @@
{ EXIT_REASON_MCE_DURING_VMENTRY,"MCE_DURING_VMENTRY" }, \
{ EXIT_REASON_TPR_BELOW_THRESHOLD,   "TPR_BELOW_THRESHOLD" }, \
{ EXIT_REASON_APIC_ACCESS,   "APIC_ACCESS" }, \
+   { EXIT_REASON_GDTR_IDTR, "GDTR_IDTR" }, \
+   { EXIT_REASON_LDTR_TR,   "LDTR_TR" }, \
{ EXIT_REASON_EPT_VIOLATION, "EPT_VIOLATION" }, \
{ EXIT_REASON_EPT_MISCONFIG, "EPT_MISCONFIG" }, \
{ EXIT_REASON_INVEPT,"INVEPT" }, \
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b133fc3d6a7d..0706e363c5e3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3342,6 +3342,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_ENABLE_EPT |
SECONDARY_EXEC_UNRESTRICTED_GUEST |
SECONDARY_EXEC_PAUSE_LOOP_EXITING |
+   SECONDARY_EXEC_DESC |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_ENABLE_INVPCID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
@@ -3967,6 +3968,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
(to_vmx(vcpu)->rmode.vm86_active ?
 KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);

+   if ((cr4 & X86_CR4_UMIP) && !boot_cpu_has(X86_FEATURE_UMIP)) {
+   vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
+ SECONDARY_EXEC_DESC);
+   hw_cr4 &= ~X86_CR4_UMIP;
+   } else
+   vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
+   SECONDARY_EXEC_DESC);
+


Since the faults based on privilege level have priority over VM exits. 
So we don't need to enable/disable SECONDARY_EXEC_DESC dynamically. 
Instead, we can set it unconditionally.



if (cr4 & X86_CR4_VMXE) {
/*
 * To use VMXON (and later other VMX instructions), a guest
@@ -4913,6 +4922,11 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 {
u32 exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
+
+   /* SECONDARY_EXEC_DESC is enabled/disabled on writes to CR4.UMIP,
+* in vmx_set_cr4.  */
+   exec_control &= ~SECONDARY_EXEC_DESC;
+
if (!cpu_need_virtualize_apic_accesses(>vcpu))
exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
if (vmx->vpid == 0)
@@ -5649,6 +5663,12 @@ static void handle_clts(struct kvm_vcpu *vcpu)
vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
 }

+static int handle_desc(struct kvm_vcpu *vcpu)
+{
+   WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP));


I think WARN_ON is too heavy since a malicious guest may trigger it always.


+   return emulate_instruction(vcpu, 0) == EMULATE_DONE;
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
  

Re: [RFC PATCH 0/4] KVM: Emulate UMIP (or almost do so)

2016-07-13 Thread Yang Zhang

On 2016/7/13 3:20, Paolo Bonzini wrote:

UMIP (User-Mode Instruction Prevention) is a feature of future
Intel processors (Cannonlake?) that blocks SLDT, SGDT, STR, SIDT


I remember there is no Cannonlake any more. It should be Icelake. :)


and SMSW from user-mode processes.


Do you know the background of this feature? For security or other purpose?



The idea here is to use virtualization intercepts to emulate UMIP; it
slows down the instructions when they're executed in ring 0, but they
are really never executed in practice.  On AMD systems it's possible
to emulate it entirely; instead on Intel systems it's *almost* possible
to emulate it, because SMSW doesn't cause a vmexit, and hence SMSW will
not fault.

This patch series provides the infrastructure and implements it on
Intel.  I tested it through kvm-unit-tests.

Still I think the idea is interesting, even if it's buggy for current
Intel processors.  Any opinions?

Paolo

Paolo Bonzini (4):
  x86: add UMIP feature and CR4 bit
  KVM: x86: emulate sldt and str
  KVM: x86: add support for emulating UMIP
  KVM: vmx: add support for emulating UMIP

 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/include/asm/vmx.h  |  1 +
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/include/uapi/asm/vmx.h |  4 +++
 arch/x86/kvm/cpuid.c|  5 +++-
 arch/x86/kvm/cpuid.h|  8 ++
 arch/x86/kvm/emulate.c  | 40 -
 arch/x86/kvm/svm.c  |  6 +
 arch/x86/kvm/vmx.c  | 40 -
 arch/x86/kvm/x86.c  |  3 +++
 11 files changed, 104 insertions(+), 9 deletions(-)




--
Yang
Alibaba Cloud Computing


Re: [RFC PATCH 0/4] KVM: Emulate UMIP (or almost do so)

2016-07-13 Thread Yang Zhang

On 2016/7/13 3:20, Paolo Bonzini wrote:

UMIP (User-Mode Instruction Prevention) is a feature of future
Intel processors (Cannonlake?) that blocks SLDT, SGDT, STR, SIDT


I remember there is no Cannonlake any more. It should be Icelake. :)


and SMSW from user-mode processes.


Do you know the background of this feature? For security or other purpose?



The idea here is to use virtualization intercepts to emulate UMIP; it
slows down the instructions when they're executed in ring 0, but they
are really never executed in practice.  On AMD systems it's possible
to emulate it entirely; instead on Intel systems it's *almost* possible
to emulate it, because SMSW doesn't cause a vmexit, and hence SMSW will
not fault.

This patch series provides the infrastructure and implements it on
Intel.  I tested it through kvm-unit-tests.

Still I think the idea is interesting, even if it's buggy for current
Intel processors.  Any opinions?

Paolo

Paolo Bonzini (4):
  x86: add UMIP feature and CR4 bit
  KVM: x86: emulate sldt and str
  KVM: x86: add support for emulating UMIP
  KVM: vmx: add support for emulating UMIP

 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/include/asm/vmx.h  |  1 +
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/include/uapi/asm/vmx.h |  4 +++
 arch/x86/kvm/cpuid.c|  5 +++-
 arch/x86/kvm/cpuid.h|  8 ++
 arch/x86/kvm/emulate.c  | 40 -
 arch/x86/kvm/svm.c  |  6 +
 arch/x86/kvm/vmx.c  | 40 -
 arch/x86/kvm/x86.c  |  3 +++
 11 files changed, 104 insertions(+), 9 deletions(-)




--
Yang
Alibaba Cloud Computing


Re: [PATCH v2 04/13] KVM: x86: dynamic kvm_apic_map

2016-07-11 Thread Yang Zhang

On 2016/7/11 23:52, Radim Krčmář wrote:

2016-07-11 16:14+0200, Paolo Bonzini:

On 11/07/2016 15:48, Radim Krčmář wrote:

I guess the easiest solution is to replace kvm_apic_id with a field in
struct kvm_lapic, which is already shifted right by 24 in xAPIC mode.


(I guess the fewest LOC is to look at vcpu->vcpu_id, which is equal to
 x2apic id.  xapic id cannot be greater than 255 and all of those are
 covered by the initial value of max_id.)


Yes, this would work too.  Or even better perhaps, look at vcpu->vcpu_id
in kvm_apic_id?


APIC ID is writeable in xAPIC mode, which would make the implementation
weird without an extra variable.  Always read-only APIC ID would be
best, IMO.


Or we can just simply put the assignment of apic_base to the end.


Yes, this would work, I'd also remove recalculates from
kvm_apic_set_*apic_id() and add a compiler barrier with comment for good
measure, even though set_virtual_x2apic_mode() serves as one.


Why a compiler barrier?


True, it should be a proper pair of smp_wmb() and smp_rmb() in
recalculate ... and current kvm_apic_id() reads in a wrong order, so
changing the apic_base alone update wouldn't get rid of this race.


(What makes a bit wary is that it doesn't avoid the same problem if we
 changed KVM to reset apic id to xapic id first when disabling apic.)


Yes, this is why I prefer it fixed once and for all in kvm_apic_id...


Seems most reasonable.  We'll need to be careful to have a correct value
in the apic page, but there shouldn't be any races there.


Yes, it is more reasonable.




Races in recalculation and APIC ID changes also lead to invalid physical
maps, which haven't been taken care of properly ...


Hmm, true, but can be fixed separately.  Probably the mutex should be
renamed so that it can be taken outside recalculate_apic_map...


Good point, it'll make reasoning easier and shouldn't introduce any
extra scalability issues.


If we can ensure all the updates to LDR,DFR,ID and apic mode are in 
correct sequence and followed with apic map recalculation, it should be 
enough. It's guest's responsibility to ensure the apic updating must
happen in right time(means no interrupt is in flying), otherwise the 
interrupt may deliver to wrong VCPU.


--
best regards
yang


Re: [PATCH v2 04/13] KVM: x86: dynamic kvm_apic_map

2016-07-11 Thread Yang Zhang

On 2016/7/11 23:52, Radim Krčmář wrote:

2016-07-11 16:14+0200, Paolo Bonzini:

On 11/07/2016 15:48, Radim Krčmář wrote:

I guess the easiest solution is to replace kvm_apic_id with a field in
struct kvm_lapic, which is already shifted right by 24 in xAPIC mode.


(I guess the fewest LOC is to look at vcpu->vcpu_id, which is equal to
 x2apic id.  xapic id cannot be greater than 255 and all of those are
 covered by the initial value of max_id.)


Yes, this would work too.  Or even better perhaps, look at vcpu->vcpu_id
in kvm_apic_id?


APIC ID is writeable in xAPIC mode, which would make the implementation
weird without an extra variable.  Always read-only APIC ID would be
best, IMO.


Or we can just simply put the assignment of apic_base to the end.


Yes, this would work, I'd also remove recalculates from
kvm_apic_set_*apic_id() and add a compiler barrier with comment for good
measure, even though set_virtual_x2apic_mode() serves as one.


Why a compiler barrier?


True, it should be a proper pair of smp_wmb() and smp_rmb() in
recalculate ... and current kvm_apic_id() reads in a wrong order, so
changing the apic_base alone update wouldn't get rid of this race.


(What makes a bit wary is that it doesn't avoid the same problem if we
 changed KVM to reset apic id to xapic id first when disabling apic.)


Yes, this is why I prefer it fixed once and for all in kvm_apic_id...


Seems most reasonable.  We'll need to be careful to have a correct value
in the apic page, but there shouldn't be any races there.


Yes, it is more reasonable.




Races in recalculation and APIC ID changes also lead to invalid physical
maps, which haven't been taken care of properly ...


Hmm, true, but can be fixed separately.  Probably the mutex should be
renamed so that it can be taken outside recalculate_apic_map...


Good point, it'll make reasoning easier and shouldn't introduce any
extra scalability issues.


If we can ensure all the updates to LDR,DFR,ID and apic mode are in 
correct sequence and followed with apic map recalculation, it should be 
enough. It's guest's responsibility to ensure the apic updating must
happen in right time(means no interrupt is in flying), otherwise the 
interrupt may deliver to wrong VCPU.


--
best regards
yang


Re: [PATCH v2 11/13] KVM: x86: add KVM_CAP_X2APIC_API

2016-07-11 Thread Yang Zhang

On 2016/7/11 17:17, Paolo Bonzini wrote:



On 11/07/2016 10:56, Yang Zhang wrote:

On 2016/7/11 15:44, Paolo Bonzini wrote:



On 11/07/2016 08:06, Yang Zhang wrote:

Changes to MSI addresses follow the format used by interrupt remapping
unit.
The upper address word, that used to be 0, contains upper 24 bits of
the LAPIC
address in its upper 24 bits.  Lower 8 bits are reserved as 0.
Using the upper address word is not backward-compatible either as we
didn't
check that userspace zeroed the word.  Reserved bits are still not
explicitly


Does this means we cannot migrate the VM from KVM_CAP_X2APIC_API enabled
host to the disable host even VM doesn't have more than 255 VCPUs?


Yes, but that's why KVM_CAP_X2APIC_API is enabled manually.  The idea is
that QEMU will not use KVM_CAP_X2APIC_API except on the newest machine
type.


Thanks for confirmation. And when the KVM_CAP_X2APIC_API will be enabled
in Qemu?


It could be 2.7 or 2.8.



If interrupt remapping is on, KVM_CAP_X2APIC_API is needed even with 8
VCPUs, I think.  Otherwise KVM will believe that 0xff is "broadcast"
rather than "cluster 0, CPUs 0-7".


If interrupt remapping is using, what 0xff means is relying on which
mode the destination CPU is in. I think there is no KVM_CAP_X2APIC_API
needed since interrupt remapping table gives all the information.


If you have EIM 0xff never means broadcast, but KVM sees a 0xff in the
interrupt route or KVM_SIGNAL_MSI argument and translates it into a
broadcast.


I see your point. I thought there would be a new irq router(like 
KVM_IRQ_ROUTING_IR) to handle all interrupts after turn on IR and 
KVM_CAP_X2APIC_API would be dropped. So we will continue to use 
KVM_IRQ_ROUTING_IRQCHIP and KVM_IRQ_ROUTING_MSI for interrupt from IR, 
right?


--
best regards
yang


Re: [PATCH v2 11/13] KVM: x86: add KVM_CAP_X2APIC_API

2016-07-11 Thread Yang Zhang

On 2016/7/11 17:17, Paolo Bonzini wrote:



On 11/07/2016 10:56, Yang Zhang wrote:

On 2016/7/11 15:44, Paolo Bonzini wrote:



On 11/07/2016 08:06, Yang Zhang wrote:

Changes to MSI addresses follow the format used by interrupt remapping
unit.
The upper address word, that used to be 0, contains upper 24 bits of
the LAPIC
address in its upper 24 bits.  Lower 8 bits are reserved as 0.
Using the upper address word is not backward-compatible either as we
didn't
check that userspace zeroed the word.  Reserved bits are still not
explicitly


Does this means we cannot migrate the VM from KVM_CAP_X2APIC_API enabled
host to the disable host even VM doesn't have more than 255 VCPUs?


Yes, but that's why KVM_CAP_X2APIC_API is enabled manually.  The idea is
that QEMU will not use KVM_CAP_X2APIC_API except on the newest machine
type.


Thanks for confirmation. And when the KVM_CAP_X2APIC_API will be enabled
in Qemu?


It could be 2.7 or 2.8.



If interrupt remapping is on, KVM_CAP_X2APIC_API is needed even with 8
VCPUs, I think.  Otherwise KVM will believe that 0xff is "broadcast"
rather than "cluster 0, CPUs 0-7".


If interrupt remapping is using, what 0xff means is relying on which
mode the destination CPU is in. I think there is no KVM_CAP_X2APIC_API
needed since interrupt remapping table gives all the information.


If you have EIM 0xff never means broadcast, but KVM sees a 0xff in the
interrupt route or KVM_SIGNAL_MSI argument and translates it into a
broadcast.


I see your point. I thought there would be a new irq router(like 
KVM_IRQ_ROUTING_IR) to handle all interrupts after turn on IR and 
KVM_CAP_X2APIC_API would be dropped. So we will continue to use 
KVM_IRQ_ROUTING_IRQCHIP and KVM_IRQ_ROUTING_MSI for interrupt from IR, 
right?


--
best regards
yang


Re: [PATCH v2 04/13] KVM: x86: dynamic kvm_apic_map

2016-07-11 Thread Yang Zhang

On 2016/7/11 15:43, Paolo Bonzini wrote:



On 11/07/2016 08:07, Yang Zhang wrote:


 mutex_lock(>arch.apic_map_lock);

+kvm_for_each_vcpu(i, vcpu, kvm)
+if (kvm_apic_present(vcpu))
+max_id = max(max_id, kvm_apic_id(vcpu->arch.apic));
+
+new = kzalloc(sizeof(struct kvm_apic_map) +
+  sizeof(struct kvm_lapic *) * (max_id + 1),
GFP_KERNEL);
+


I think this may cause the host runs out of memory if a malicious guest
did follow thing:
1. vcpu a is doing apic map recalculation.
2. vcpu b write the apic id with 0xff
3. then vcpu b enable the x2apic: in kvm_lapic_set_base(), we will set
apic_base to new value before reset the apic id.
4. vcpu a may see the x2apic enabled in vcpu b plus an old apic
id(0xff), and max_id will become (0xff >> 24).


The bug is not really here but in patch 6---but you're right nevertheless!

I guess the easiest solution is to replace kvm_apic_id with a field in
struct kvm_lapic, which is already shifted right by 24 in xAPIC mode.


Or we can just simply put the assignment of apic_base to the end.

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index fdc05ae..9c69059 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1745,7 +1745,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
value)

return;
}

-   vcpu->arch.apic_base = value;

/* update jump label if enable bit changes */
if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) {
@@ -1753,7 +1752,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
value)

static_key_slow_dec_deferred(_hw_disabled);
else
static_key_slow_inc(_hw_disabled.key);
-   recalculate_apic_map(vcpu->kvm);
}

if ((old_value ^ value) & X2APIC_ENABLE) {
@@ -1764,6 +1762,8 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
value)

kvm_x86_ops->set_virtual_x2apic_mode(vcpu, false);
}

+   vcpu->arch.apic_base = value;
+   recalculate_apic_map(vcpu->kvm);
apic->base_address = apic->vcpu->arch.apic_base &
 MSR_IA32_APICBASE_BASE;


btw, i noticed that there is no apic map recalculation after turn off 
the x2apic mode.Is it correct?


--
best regards
yang


Re: [PATCH v2 04/13] KVM: x86: dynamic kvm_apic_map

2016-07-11 Thread Yang Zhang

On 2016/7/11 15:43, Paolo Bonzini wrote:



On 11/07/2016 08:07, Yang Zhang wrote:


 mutex_lock(>arch.apic_map_lock);

+kvm_for_each_vcpu(i, vcpu, kvm)
+if (kvm_apic_present(vcpu))
+max_id = max(max_id, kvm_apic_id(vcpu->arch.apic));
+
+new = kzalloc(sizeof(struct kvm_apic_map) +
+  sizeof(struct kvm_lapic *) * (max_id + 1),
GFP_KERNEL);
+


I think this may cause the host runs out of memory if a malicious guest
did follow thing:
1. vcpu a is doing apic map recalculation.
2. vcpu b write the apic id with 0xff
3. then vcpu b enable the x2apic: in kvm_lapic_set_base(), we will set
apic_base to new value before reset the apic id.
4. vcpu a may see the x2apic enabled in vcpu b plus an old apic
id(0xff), and max_id will become (0xff >> 24).


The bug is not really here but in patch 6---but you're right nevertheless!

I guess the easiest solution is to replace kvm_apic_id with a field in
struct kvm_lapic, which is already shifted right by 24 in xAPIC mode.


Or we can just simply put the assignment of apic_base to the end.

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index fdc05ae..9c69059 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1745,7 +1745,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
value)

return;
}

-   vcpu->arch.apic_base = value;

/* update jump label if enable bit changes */
if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) {
@@ -1753,7 +1752,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
value)

static_key_slow_dec_deferred(_hw_disabled);
else
static_key_slow_inc(_hw_disabled.key);
-   recalculate_apic_map(vcpu->kvm);
}

if ((old_value ^ value) & X2APIC_ENABLE) {
@@ -1764,6 +1762,8 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
value)

kvm_x86_ops->set_virtual_x2apic_mode(vcpu, false);
}

+   vcpu->arch.apic_base = value;
+   recalculate_apic_map(vcpu->kvm);
apic->base_address = apic->vcpu->arch.apic_base &
 MSR_IA32_APICBASE_BASE;


btw, i noticed that there is no apic map recalculation after turn off 
the x2apic mode.Is it correct?


--
best regards
yang


Re: [PATCH v2 11/13] KVM: x86: add KVM_CAP_X2APIC_API

2016-07-11 Thread Yang Zhang

On 2016/7/11 15:44, Paolo Bonzini wrote:



On 11/07/2016 08:06, Yang Zhang wrote:

Changes to MSI addresses follow the format used by interrupt remapping
unit.
The upper address word, that used to be 0, contains upper 24 bits of
the LAPIC
address in its upper 24 bits.  Lower 8 bits are reserved as 0.
Using the upper address word is not backward-compatible either as we
didn't
check that userspace zeroed the word.  Reserved bits are still not
explicitly


Does this means we cannot migrate the VM from KVM_CAP_X2APIC_API enabled
host to the disable host even VM doesn't have more than 255 VCPUs?


Yes, but that's why KVM_CAP_X2APIC_API is enabled manually.  The idea is
that QEMU will not use KVM_CAP_X2APIC_API except on the newest machine type.


Thanks for confirmation. And when the KVM_CAP_X2APIC_API will be enabled 
in Qemu?




If interrupt remapping is on, KVM_CAP_X2APIC_API is needed even with 8
VCPUs, I think.  Otherwise KVM will believe that 0xff is "broadcast"
rather than "cluster 0, CPUs 0-7".


If interrupt remapping is using, what 0xff means is relying on which 
mode the destination CPU is in. I think there is no KVM_CAP_X2APIC_API 
needed since interrupt remapping table gives all the information.


--
best regards
yang


Re: [PATCH v2 11/13] KVM: x86: add KVM_CAP_X2APIC_API

2016-07-11 Thread Yang Zhang

On 2016/7/11 15:44, Paolo Bonzini wrote:



On 11/07/2016 08:06, Yang Zhang wrote:

Changes to MSI addresses follow the format used by interrupt remapping
unit.
The upper address word, that used to be 0, contains upper 24 bits of
the LAPIC
address in its upper 24 bits.  Lower 8 bits are reserved as 0.
Using the upper address word is not backward-compatible either as we
didn't
check that userspace zeroed the word.  Reserved bits are still not
explicitly


Does this means we cannot migrate the VM from KVM_CAP_X2APIC_API enabled
host to the disable host even VM doesn't have more than 255 VCPUs?


Yes, but that's why KVM_CAP_X2APIC_API is enabled manually.  The idea is
that QEMU will not use KVM_CAP_X2APIC_API except on the newest machine type.


Thanks for confirmation. And when the KVM_CAP_X2APIC_API will be enabled 
in Qemu?




If interrupt remapping is on, KVM_CAP_X2APIC_API is needed even with 8
VCPUs, I think.  Otherwise KVM will believe that 0xff is "broadcast"
rather than "cluster 0, CPUs 0-7".


If interrupt remapping is using, what 0xff means is relying on which 
mode the destination CPU is in. I think there is no KVM_CAP_X2APIC_API 
needed since interrupt remapping table gives all the information.


--
best regards
yang


  1   2   >