Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-17 Thread Vitaly Kuznetsov
Boris Ostrovsky  writes:

> On 08/16/2017 12:42 PM, Vitaly Kuznetsov wrote:
>> Vitaly Kuznetsov  writes:
>>
>>> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
>>> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
>>> do you think about the required testing? Any suggestion for a
>>> specifically crafted micro benchmark in addition to standard
>>> ebizzy/kernbench/...?
>> In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement
>> patch I used is attached; I know that it breaks other architectures) on
>> bare metal with PARAVIRT enabled in config. The results are:
>>
>>...
>>
>> As you can see, there's no notable difference. I'll think of a
>> microbenchmark though.
>
> FWIW, I was about to send a very similar patch (but with only Xen-PV
> enabling RCU-based free by default) and saw similar results with
> kernbench, both Xen PV and baremetal.
>

Thanks for the confirmation,

I'd go with enabling it for PARAVIRT as we will need it for Hyper-V too.



>>  
>>  #if CONFIG_PGTABLE_LEVELS > 4
>>  void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
>>  {
>>  paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
>> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
>> +tlb_remove_table(tlb, virt_to_page(p4d));
>> +#else
>>  tlb_remove_page(tlb, virt_to_page(p4d));
>> +#endif
>
> This can probably be factored out.
>
>>  }
>>  #endif  /* CONFIG_PGTABLE_LEVELS > 4 */
>>  #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index e158f7ac6730..18d6671b6ae2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -329,6 +329,11 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, 
>> struct page *page, int page_
>>   * See the comment near struct mmu_table_batch.
>>   */
>>  
>> +static void __tlb_remove_table(void *table)
>> +{
>> +free_page_and_swap_cache(table);
>> +}
>> +
>
> This needs to be a per-arch routine (e.g. see arch/arm64/include/asm/tlb.h).
>

Yea, this was a quick-and-dirty x86-only patch.

-- 
  Vitaly


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-16 Thread Boris Ostrovsky
On 08/16/2017 12:42 PM, Vitaly Kuznetsov wrote:
> Vitaly Kuznetsov  writes:
>
>> Peter Zijlstra  writes:
>>
>>> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
 On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra  
 wrote:
> I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
> would make it work again), but this was some years ago and I cannot
> readily find those emails.
 I think the only time we really talked about HAVE_RCU_TABLE_FREE for
 x86 (at least that I was cc'd on) was not because of RCU freeing, but
 because we just wanted to use the generic page table lookup code on
 x86 *despite* not using RCU freeing.

 And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.

 There was only passing mention of maybe making x86 use RCU, but the
 discussion was really about why the IF flag meant that x86 didn't need
 to, iirc.

 I don't recall us ever discussing *really* making x86 use RCU.
>>> Google finds me this:
>>>
>>>   https://lwn.net/Articles/500188/
>>>
>>> Which includes:
>>>
>>>   http://www.mail-archive.com/kvm@vger.kernel.org/msg72918.html
>>>
>>> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
>>> PARAVIRT_TLB_FLUSH.
>>>
>>> But yes, this is very much virt specific nonsense, native would never
>>> need this.
>> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
>> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
>> do you think about the required testing? Any suggestion for a
>> specifically crafted micro benchmark in addition to standard
>> ebizzy/kernbench/...?
> In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement
> patch I used is attached; I know that it breaks other architectures) on
> bare metal with PARAVIRT enabled in config. The results are:
>
> 6-CPU host:
>
> Average Half load -j 3 Run (std deviation):
> CURRENT   HAVE_RCU_TABLE_FREE
> ===   ===
> Elapsed Time 400.498 (0.179679)   Elapsed Time 399.909 (0.162853)
> User Time 1098.72 (0.278536)  User Time 1097.59 (0.283894)
> System Time 100.301 (0.201629)System Time 99.736 (0.196254)
> Percent CPU 299 (0)   Percent CPU 299 (0)
> Context Switches 5774.1 (69.2121) Context Switches 5744.4 (79.4162)
> Sleeps 87621.2 (78.1093)  Sleeps 87586.1 (99.7079)
>
> Average Optimal load -j 24 Run (std deviation):
> CURRENT   HAVE_RCU_TABLE_FREE
> ===   ===
> Elapsed Time 219.03 (0.652534)Elapsed Time 218.959 (0.598674)
> User Time 1119.51 (21.3284)   User Time 1118.81 (21.7793)
> System Time 100.499 (0.389308)System Time 99.8335 (0.251423)
> Percent CPU 432.5 (136.974)   Percent CPU 432.45 (136.922)
> Context Switches 81827.4 (78029.5)Context Switches 81818.5 (78051)
> Sleeps 97124.8 (9822.4)   Sleeps 97207.9 (9955.04)
>
> 16-CPU host:
>
> Average Half load -j 8 Run (std deviation):
> CURRENT   HAVE_RCU_TABLE_FREE
> ===   ===
> Elapsed Time 213.538 (3.7891) Elapsed Time 212.5 (3.10939)
> User Time 1306.4 (1.83399)User Time 1307.65 (1.01364)
> System Time 194.59 (0.864378) System Time 195.478 (0.794588)
> Percent CPU 702.6 (13.5388)   Percent CPU 707 (11.1131)
> Context Switches 21189.2 (1199.4) Context Switches 21288.2 (552.388)
> Sleeps 89390.2 (482.325)  Sleeps 89677 (277.06)
>
> Average Optimal load -j 64 Run (std deviation):
> CURRENT   HAVE_RCU_TABLE_FREE
> ===   ===
> Elapsed Time 137.866 (0.787928)   Elapsed Time 138.438 (0.218792)
> User Time 1488.92 (192.399)   User Time 1489.92 (192.135)
> System Time 234.981 (42.5806) System Time 236.09 (42.8138)
> Percent CPU 1057.1 (373.826)  Percent CPU 1057.1 (369.114)
> Context Switches 187514 (175324)  Context Switches 187358 (175060)
> Sleeps 112633 (24535.5)   Sleeps 111743 (23297.6)
>
> As you can see, there's no notable difference. I'll think of a
> microbenchmark though.

FWIW, I was about to send a very similar patch (but with only Xen-PV
enabling RCU-based free by default) and saw similar results with
kernbench, both Xen PV and baremetal.

>> Additionally, I see another option for us: enable 'rcu table free' on
>> boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time
>> patching for it) so bare metal and other hypervisors are not affected
>> by the change.
> It seems there's no need for that and we can keep things simple...
>
> -- Vitaly
>
> 0001-x86-enable-RCU-based-table-free-w

Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-16 Thread Vitaly Kuznetsov
Vitaly Kuznetsov  writes:

> Peter Zijlstra  writes:
>
>> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
>>> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra  
>>> wrote:
>>> >
>>> > I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
>>> > would make it work again), but this was some years ago and I cannot
>>> > readily find those emails.
>>> 
>>> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
>>> x86 (at least that I was cc'd on) was not because of RCU freeing, but
>>> because we just wanted to use the generic page table lookup code on
>>> x86 *despite* not using RCU freeing.
>>> 
>>> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
>>> 
>>> There was only passing mention of maybe making x86 use RCU, but the
>>> discussion was really about why the IF flag meant that x86 didn't need
>>> to, iirc.
>>> 
>>> I don't recall us ever discussing *really* making x86 use RCU.
>>
>> Google finds me this:
>>
>>   https://lwn.net/Articles/500188/
>>
>> Which includes:
>>
>>   http://www.mail-archive.com/kvm@vger.kernel.org/msg72918.html
>>
>> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
>> PARAVIRT_TLB_FLUSH.
>>
>> But yes, this is very much virt specific nonsense, native would never
>> need this.
>
> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
> do you think about the required testing? Any suggestion for a
> specifically crafted micro benchmark in addition to standard
> ebizzy/kernbench/...?

In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement
patch I used is attached; I know that it breaks other architectures) on
bare metal with PARAVIRT enabled in config. The results are:

6-CPU host:

Average Half load -j 3 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
=== ===
Elapsed Time 400.498 (0.179679) Elapsed Time 399.909 (0.162853)
User Time 1098.72 (0.278536)User Time 1097.59 (0.283894)
System Time 100.301 (0.201629)  System Time 99.736 (0.196254)
Percent CPU 299 (0) Percent CPU 299 (0)
Context Switches 5774.1 (69.2121)   Context Switches 5744.4 (79.4162)
Sleeps 87621.2 (78.1093)Sleeps 87586.1 (99.7079)

Average Optimal load -j 24 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
=== ===
Elapsed Time 219.03 (0.652534)  Elapsed Time 218.959 (0.598674)
User Time 1119.51 (21.3284) User Time 1118.81 (21.7793)
System Time 100.499 (0.389308)  System Time 99.8335 (0.251423)
Percent CPU 432.5 (136.974) Percent CPU 432.45 (136.922)
Context Switches 81827.4 (78029.5)  Context Switches 81818.5 (78051)
Sleeps 97124.8 (9822.4) Sleeps 97207.9 (9955.04)

16-CPU host:

Average Half load -j 8 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
=== ===
Elapsed Time 213.538 (3.7891)   Elapsed Time 212.5 (3.10939)
User Time 1306.4 (1.83399)  User Time 1307.65 (1.01364)
System Time 194.59 (0.864378)   System Time 195.478 (0.794588)
Percent CPU 702.6 (13.5388) Percent CPU 707 (11.1131)
Context Switches 21189.2 (1199.4)   Context Switches 21288.2 (552.388)
Sleeps 89390.2 (482.325)Sleeps 89677 (277.06)

Average Optimal load -j 64 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
=== ===
Elapsed Time 137.866 (0.787928) Elapsed Time 138.438 (0.218792)
User Time 1488.92 (192.399) User Time 1489.92 (192.135)
System Time 234.981 (42.5806)   System Time 236.09 (42.8138)
Percent CPU 1057.1 (373.826)Percent CPU 1057.1 (369.114)
Context Switches 187514 (175324)Context Switches 187358 (175060)
Sleeps 112633 (24535.5) Sleeps 111743 (23297.6)

As you can see, there's no notable difference. I'll think of a
microbenchmark though.

>
> Additionally, I see another option for us: enable 'rcu table free' on
> boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time
> patching for it) so bare metal and other hypervisors are not affected
> by the change.

It seems there's no need for that and we can keep things simple...

-- 
  Vitaly

>From daf5117706920aebe793d1239fccac2edd4d680c Mon Sep 17 00:00:00 2001
From: Vitaly Kuznetsov 
Date: Mon, 14 Aug 2017 16:05:05 +0200
Subject: [PATCH] x86: enable RCU based table free when PARAVIRT

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/Kconfig  |  1 +
 arch/x86/mm/pgtable.c | 16 
 mm/memory.c   |  5 +
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
in

Re: [Xen-devel] [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-15 Thread Steven Rostedt
On Fri, 11 Aug 2017 14:07:14 +0200
Peter Zijlstra  wrote:

> It goes like:
> 
>   CPU0CPU1
> 
>   unhook page
>   cli
>   traverse page tables
>   TLB invalidate ---> 
>   sti
>   
>TLB invalidate
>   <--  complete

I guess the important part here is the above "complete". CPU0 doesn't
proceed until its receives it. Thus it does act like
cli~rcu_read_lock(), sti~rcu_read_unlock(), and "TLB invalidate" is
equivalent to synchronize_rcu().

[ this response is for clarification for the casual observer of this
  thread ;-) ]

-- Steve

>   
>   free page
> 
> So the CPU1 page-table walker gets an existence guarantee of the
> page-tables by clearing IF.



Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-14 Thread Vitaly Kuznetsov
Peter Zijlstra  writes:

> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
>> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra  wrote:
>> >
>> > I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
>> > would make it work again), but this was some years ago and I cannot
>> > readily find those emails.
>> 
>> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
>> x86 (at least that I was cc'd on) was not because of RCU freeing, but
>> because we just wanted to use the generic page table lookup code on
>> x86 *despite* not using RCU freeing.
>> 
>> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
>> 
>> There was only passing mention of maybe making x86 use RCU, but the
>> discussion was really about why the IF flag meant that x86 didn't need
>> to, iirc.
>> 
>> I don't recall us ever discussing *really* making x86 use RCU.
>
> Google finds me this:
>
>   https://lwn.net/Articles/500188/
>
> Which includes:
>
>   http://www.mail-archive.com/kvm@vger.kernel.org/msg72918.html
>
> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
> PARAVIRT_TLB_FLUSH.
>
> But yes, this is very much virt specific nonsense, native would never
> need this.

In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
do you think about the required testing? Any suggestion for a
specifically crafted micro benchmark in addition to standard
ebizzy/kernbench/...?

Additionally, I see another option for us: enable 'rcu table free' on
boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time
patching for it) so bare metal and other hypervisors are not affected
by the change.

-- 
  Vitaly


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Peter Zijlstra
On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra  wrote:
> >
> > I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
> > would make it work again), but this was some years ago and I cannot
> > readily find those emails.
> 
> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
> x86 (at least that I was cc'd on) was not because of RCU freeing, but
> because we just wanted to use the generic page table lookup code on
> x86 *despite* not using RCU freeing.
> 
> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
> 
> There was only passing mention of maybe making x86 use RCU, but the
> discussion was really about why the IF flag meant that x86 didn't need
> to, iirc.
> 
> I don't recall us ever discussing *really* making x86 use RCU.

Google finds me this:

  https://lwn.net/Articles/500188/

Which includes:

  http://www.mail-archive.com/kvm@vger.kernel.org/msg72918.html

which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
PARAVIRT_TLB_FLUSH.

But yes, this is very much virt specific nonsense, native would never
need this.


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Linus Torvalds
On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra  wrote:
>
> I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
> would make it work again), but this was some years ago and I cannot
> readily find those emails.

I think the only time we really talked about HAVE_RCU_TABLE_FREE for
x86 (at least that I was cc'd on) was not because of RCU freeing, but
because we just wanted to use the generic page table lookup code on
x86 *despite* not using RCU freeing.

And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.

There was only passing mention of maybe making x86 use RCU, but the
discussion was really about why the IF flag meant that x86 didn't need
to, iirc.

I don't recall us ever discussing *really* making x86 use RCU.

Linus


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Peter Zijlstra
On Fri, Aug 11, 2017 at 03:07:29PM +0200, Juergen Gross wrote:
> On 11/08/17 14:54, Peter Zijlstra wrote:
> > On Fri, Aug 11, 2017 at 02:46:41PM +0200, Juergen Gross wrote:
> >> Aah, okay. Now I understand the problem. The TLB isn't the issue but the
> >> IPI is serving two purposes here: TLB flushing (which is allowed to
> >> happen at any time) and serialization regarding access to critical pages
> >> (which seems to be broken in the Xen case as you suggest).
> > 
> > Indeed, and now hyper-v as well.
> 
> Is it possible to distinguish between non-critical calls of
> flush_tlb_others() (which should be the majority IMHO) and critical ones
> regarding above problem? I guess the only problem is the case when a
> page table can be freed because its last valid entry is gone, right?
> 
> We might want to add a serialization flag to indicate flushing _and_
> serialization via IPI should be performed.

Possible, but not trivial. Esp things like transparent huge pages, which
swizzles PMDs around makes things tricky.

The by far easiest solution is to switch over to HAVE_RCU_TABLE_FREE
when either Xen or Hyper-V is doing this. Ideally it would not have a
significant performance hit (needs testing) and we can simply always do
this when PARAVIRT, or otherwise we need to get creative with
static_keys or something.


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Juergen Gross
On 11/08/17 14:54, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 02:46:41PM +0200, Juergen Gross wrote:
>> Aah, okay. Now I understand the problem. The TLB isn't the issue but the
>> IPI is serving two purposes here: TLB flushing (which is allowed to
>> happen at any time) and serialization regarding access to critical pages
>> (which seems to be broken in the Xen case as you suggest).
> 
> Indeed, and now hyper-v as well.

Is it possible to distinguish between non-critical calls of
flush_tlb_others() (which should be the majority IMHO) and critical ones
regarding above problem? I guess the only problem is the case when a
page table can be freed because its last valid entry is gone, right?

We might want to add a serialization flag to indicate flushing _and_
serialization via IPI should be performed.


Juergen


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Peter Zijlstra
On Fri, Aug 11, 2017 at 02:46:41PM +0200, Juergen Gross wrote:
> Aah, okay. Now I understand the problem. The TLB isn't the issue but the
> IPI is serving two purposes here: TLB flushing (which is allowed to
> happen at any time) and serialization regarding access to critical pages
> (which seems to be broken in the Xen case as you suggest).

Indeed, and now hyper-v as well.


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Juergen Gross
On 11/08/17 14:35, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 02:22:25PM +0200, Juergen Gross wrote:
>> Wait - the TLB can be cleared at any time, as Andrew was pointing out.
>> No cpu can rely on an address being accessible just because IF is being
>> cleared. All that matters is the existing and valid page table entry.
>>
>> So clearing IF on a cpu isn't meant to secure the TLB from being
>> cleared, but just to avoid interrupts (as the name of the flag is
>> suggesting).
> 
> Yes, but by holding off the TLB invalidate IPI, we hold off the freeing
> of the concurrently unhooked page-table.
> 
>> In the Xen case the hypervisor does the following:
>>
>> - it checks whether any of the vcpus specified in the cpumask of the
>>   flush request is running on any physical cpu
>> - if any running vcpu is found an IPI will be sent to the physical cpu
>>   and the hypervisor will do the TLB flush there
> 
> And this will preempt a vcpu which could have IF cleared, right?
> 
>> - any vcpu addressed by the flush and not running will be flagged to
>>   flush its TLB when being scheduled the next time
>>
>> This ensures no TLB entry to be flushed can be used after return of
>> xen_flush_tlb_others().
> 
> But that is not a sufficient guarantee. We need the IF to hold off the
> TLB invalidate and thereby hold off the freeing of our page-table pages.

Aah, okay. Now I understand the problem. The TLB isn't the issue but the
IPI is serving two purposes here: TLB flushing (which is allowed to
happen at any time) and serialization regarding access to critical pages
(which seems to be broken in the Xen case as you suggest).

Juergen

> 



Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Peter Zijlstra
On Fri, Aug 11, 2017 at 02:22:25PM +0200, Juergen Gross wrote:
> Wait - the TLB can be cleared at any time, as Andrew was pointing out.
> No cpu can rely on an address being accessible just because IF is being
> cleared. All that matters is the existing and valid page table entry.
> 
> So clearing IF on a cpu isn't meant to secure the TLB from being
> cleared, but just to avoid interrupts (as the name of the flag is
> suggesting).

Yes, but by holding off the TLB invalidate IPI, we hold off the freeing
of the concurrently unhooked page-table.

> In the Xen case the hypervisor does the following:
> 
> - it checks whether any of the vcpus specified in the cpumask of the
>   flush request is running on any physical cpu
> - if any running vcpu is found an IPI will be sent to the physical cpu
>   and the hypervisor will do the TLB flush there

And this will preempt a vcpu which could have IF cleared, right?

> - any vcpu addressed by the flush and not running will be flagged to
>   flush its TLB when being scheduled the next time
> 
> This ensures no TLB entry to be flushed can be used after return of
> xen_flush_tlb_others().

But that is not a sufficient guarantee. We need the IF to hold off the
TLB invalidate and thereby hold off the freeing of our page-table pages.


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Juergen Gross
On 11/08/17 12:56, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 11:23:10AM +0200, Vitaly Kuznetsov wrote:
>> Peter Zijlstra  writes:
>>
>>> On Thu, Aug 10, 2017 at 07:08:22PM +, Jork Loeser wrote:
>>>
>>>>>> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote 
>>>>>> TLB flush
>>>>
>>>>>> Hold on.. if we don't IPI for TLB invalidation. What serializes our
>>>>>> software page table walkers like fast_gup() ?
>>>>>
>>>>> Hypervisor may implement this functionality via an IPI.
>>>>>
>>>>> K. Y
>>>>
>>>> HvFlushVirtualAddressList() states:
>>>> This call guarantees that by the time control returns back to the
>>>> caller, the observable effects of all flushes on the specified virtual
>>>> processors have occurred.
>>>>
>>>> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as 
>>>> adding sparse target VP lists.
>>>>
>>>> Is this enough of a guarantee, or do you see other races?
>>>
>>> That's nowhere near enough. We need the remote CPU to have completed any
>>> guest IF section that was in progress at the time of the call.
>>>
>>> So if a host IPI can interrupt a guest while the guest has IF cleared,
>>> and we then process the host IPI -- clear the TLBs -- before resuming the
>>> guest, which still has IF cleared, we've got a problem.
>>>
>>> Because at that point, our software page-table walker, that relies on IF
>>> being clear to guarantee the page-tables exist, because it holds off the
>>> TLB invalidate and thereby the freeing of the pages, gets its pages
>>> ripped out from under it.
>>
>> Oh, I see your concern. Hyper-V, however, is not the first x86
>> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
>> too. Briefly looking at xen_flush_tlb_others() I don't see anything
>> special, do we know how serialization is achieved there?
> 
> No idea on how Xen works, I always just hope it goes away :-) But lets
> ask some Xen folks.

Wait - the TLB can be cleared at any time, as Andrew was pointing out.
No cpu can rely on an address being accessible just because IF is being
cleared. All that matters is the existing and valid page table entry.

So clearing IF on a cpu isn't meant to secure the TLB from being
cleared, but just to avoid interrupts (as the name of the flag is
suggesting).

In the Xen case the hypervisor does the following:

- it checks whether any of the vcpus specified in the cpumask of the
  flush request is running on any physical cpu
- if any running vcpu is found an IPI will be sent to the physical cpu
  and the hypervisor will do the TLB flush there
- any vcpu addressed by the flush and not running will be flagged to
  flush its TLB when being scheduled the next time

This ensures no TLB entry to be flushed can be used after return of
xen_flush_tlb_others().


Juergen


Re: [Xen-devel] [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Peter Zijlstra
On Fri, Aug 11, 2017 at 12:05:45PM +0100, Andrew Cooper wrote:
> >> Oh, I see your concern. Hyper-V, however, is not the first x86
> >> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
> >> too. Briefly looking at xen_flush_tlb_others() I don't see anything
> >> special, do we know how serialization is achieved there?
> > No idea on how Xen works, I always just hope it goes away :-) But lets
> > ask some Xen folks.
> 
> How is the software pagewalker relying on IF being clear safe at all (on
> native, let alone under virtualisation)?  Hardware has no architectural
> requirement to keep entries in the TLB.

No, but it _can_, therefore when we unhook pages we _must_ invalidate.

It goes like:

CPU0CPU1

unhook page
cli
traverse page tables
TLB invalidate ---> 
sti

 TLB invalidate
<--  complete

free page

So the CPU1 page-table walker gets an existence guarantee of the
page-tables by clearing IF.

> In the virtualisation case, at any point the vcpu can be scheduled on a
> different pcpu even during a critical region like that, so the TLB
> really can empty itself under your feet.

Not the point.



Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Kirill A. Shutemov
On Fri, Aug 11, 2017 at 11:03:36AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 01:15:18AM +, Jork Loeser wrote:
> 
> > > > HvFlushVirtualAddressList() states:
> > > > This call guarantees that by the time control returns back to the
> > > > caller, the observable effects of all flushes on the specified virtual
> > > > processors have occurred.
> > > >
> > > > HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as 
> > > > adding
> > > > sparse target VP lists.
> > > >
> > > > Is this enough of a guarantee, or do you see other races?
> > > 
> > > That's nowhere near enough. We need the remote CPU to have completed any
> > > guest IF section that was in progress at the time of the call.
> > > 
> > > So if a host IPI can interrupt a guest while the guest has IF cleared, 
> > > and we then
> > > process the host IPI -- clear the TLBs -- before resuming the guest, 
> > > which still has
> > > IF cleared, we've got a problem.
> > > 
> > > Because at that point, our software page-table walker, that relies on IF 
> > > being
> > > clear to guarantee the page-tables exist, because it holds off the TLB 
> > > invalidate
> > > and thereby the freeing of the pages, gets its pages ripped out from 
> > > under it.
> > 
> > I see, IF is used as a locking mechanism for the pages. Would
> > CONFIG_HAVE_RCU_TABLE_FREE be an option for x86? There are caveats
> > (statically enabled, RCU for page-free), yet if the resulting perf is
> > still a gain it would be worthwhile for Hyper-V targeted kernels.
> 
> I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
> would make it work again), but this was some years ago and I cannot
> readily find those emails.
> 
> Kirill would you have any opinions?

I guess we can try this. The main question is what would be performance
implications of such move.

-- 
 Kirill A. Shutemov


Re: [Xen-devel] [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Andrew Cooper
On 11/08/17 11:56, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 11:23:10AM +0200, Vitaly Kuznetsov wrote:
>> Peter Zijlstra  writes:
>>
>>> On Thu, Aug 10, 2017 at 07:08:22PM +, Jork Loeser wrote:
>>>
>>>>>> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote 
>>>>>> TLB flush
>>>>>> Hold on.. if we don't IPI for TLB invalidation. What serializes our
>>>>>> software page table walkers like fast_gup() ?
>>>>> Hypervisor may implement this functionality via an IPI.
>>>>>
>>>>> K. Y
>>>> HvFlushVirtualAddressList() states:
>>>> This call guarantees that by the time control returns back to the
>>>> caller, the observable effects of all flushes on the specified virtual
>>>> processors have occurred.
>>>>
>>>> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as 
>>>> adding sparse target VP lists.
>>>>
>>>> Is this enough of a guarantee, or do you see other races?
>>> That's nowhere near enough. We need the remote CPU to have completed any
>>> guest IF section that was in progress at the time of the call.
>>>
>>> So if a host IPI can interrupt a guest while the guest has IF cleared,
>>> and we then process the host IPI -- clear the TLBs -- before resuming the
>>> guest, which still has IF cleared, we've got a problem.
>>>
>>> Because at that point, our software page-table walker, that relies on IF
>>> being clear to guarantee the page-tables exist, because it holds off the
>>> TLB invalidate and thereby the freeing of the pages, gets its pages
>>> ripped out from under it.
>> Oh, I see your concern. Hyper-V, however, is not the first x86
>> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
>> too. Briefly looking at xen_flush_tlb_others() I don't see anything
>> special, do we know how serialization is achieved there?
> No idea on how Xen works, I always just hope it goes away :-) But lets
> ask some Xen folks.

How is the software pagewalker relying on IF being clear safe at all (on
native, let alone under virtualisation)?  Hardware has no architectural
requirement to keep entries in the TLB.

In the virtualisation case, at any point the vcpu can be scheduled on a
different pcpu even during a critical region like that, so the TLB
really can empty itself under your feet.

~Andrew


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Peter Zijlstra
On Fri, Aug 11, 2017 at 11:23:10AM +0200, Vitaly Kuznetsov wrote:
> Peter Zijlstra  writes:
> 
> > On Thu, Aug 10, 2017 at 07:08:22PM +, Jork Loeser wrote:
> >
> >> > > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote 
> >> > > TLB flush
> >> 
> >> > > Hold on.. if we don't IPI for TLB invalidation. What serializes our
> >> > > software page table walkers like fast_gup() ?
> >> > 
> >> > Hypervisor may implement this functionality via an IPI.
> >> > 
> >> > K. Y
> >> 
> >> HvFlushVirtualAddressList() states:
> >> This call guarantees that by the time control returns back to the
> >> caller, the observable effects of all flushes on the specified virtual
> >> processors have occurred.
> >> 
> >> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as 
> >> adding sparse target VP lists.
> >> 
> >> Is this enough of a guarantee, or do you see other races?
> >
> > That's nowhere near enough. We need the remote CPU to have completed any
> > guest IF section that was in progress at the time of the call.
> >
> > So if a host IPI can interrupt a guest while the guest has IF cleared,
> > and we then process the host IPI -- clear the TLBs -- before resuming the
> > guest, which still has IF cleared, we've got a problem.
> >
> > Because at that point, our software page-table walker, that relies on IF
> > being clear to guarantee the page-tables exist, because it holds off the
> > TLB invalidate and thereby the freeing of the pages, gets its pages
> > ripped out from under it.
> 
> Oh, I see your concern. Hyper-V, however, is not the first x86
> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
> too. Briefly looking at xen_flush_tlb_others() I don't see anything
> special, do we know how serialization is achieved there?

No idea on how Xen works, I always just hope it goes away :-) But lets
ask some Xen folks.



Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Vitaly Kuznetsov
Peter Zijlstra  writes:

> On Thu, Aug 10, 2017 at 07:08:22PM +, Jork Loeser wrote:
>
>> > > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote 
>> > > TLB flush
>> 
>> > > Hold on.. if we don't IPI for TLB invalidation. What serializes our
>> > > software page table walkers like fast_gup() ?
>> > 
>> > Hypervisor may implement this functionality via an IPI.
>> > 
>> > K. Y
>> 
>> HvFlushVirtualAddressList() states:
>> This call guarantees that by the time control returns back to the
>> caller, the observable effects of all flushes on the specified virtual
>> processors have occurred.
>> 
>> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as 
>> adding sparse target VP lists.
>> 
>> Is this enough of a guarantee, or do you see other races?
>
> That's nowhere near enough. We need the remote CPU to have completed any
> guest IF section that was in progress at the time of the call.
>
> So if a host IPI can interrupt a guest while the guest has IF cleared,
> and we then process the host IPI -- clear the TLBs -- before resuming the
> guest, which still has IF cleared, we've got a problem.
>
> Because at that point, our software page-table walker, that relies on IF
> being clear to guarantee the page-tables exist, because it holds off the
> TLB invalidate and thereby the freeing of the pages, gets its pages
> ripped out from under it.

Oh, I see your concern. Hyper-V, however, is not the first x86
hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
too. Briefly looking at xen_flush_tlb_others() I don't see anything
special, do we know how serialization is achieved there?

-- 
  Vitaly


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-11 Thread Peter Zijlstra
On Fri, Aug 11, 2017 at 01:15:18AM +, Jork Loeser wrote:

> > > HvFlushVirtualAddressList() states:
> > > This call guarantees that by the time control returns back to the
> > > caller, the observable effects of all flushes on the specified virtual
> > > processors have occurred.
> > >
> > > HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as 
> > > adding
> > > sparse target VP lists.
> > >
> > > Is this enough of a guarantee, or do you see other races?
> > 
> > That's nowhere near enough. We need the remote CPU to have completed any
> > guest IF section that was in progress at the time of the call.
> > 
> > So if a host IPI can interrupt a guest while the guest has IF cleared, and 
> > we then
> > process the host IPI -- clear the TLBs -- before resuming the guest, which 
> > still has
> > IF cleared, we've got a problem.
> > 
> > Because at that point, our software page-table walker, that relies on IF 
> > being
> > clear to guarantee the page-tables exist, because it holds off the TLB 
> > invalidate
> > and thereby the freeing of the pages, gets its pages ripped out from under 
> > it.
> 
> I see, IF is used as a locking mechanism for the pages. Would
> CONFIG_HAVE_RCU_TABLE_FREE be an option for x86? There are caveats
> (statically enabled, RCU for page-free), yet if the resulting perf is
> still a gain it would be worthwhile for Hyper-V targeted kernels.

I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
would make it work again), but this was some years ago and I cannot
readily find those emails.

Kirill would you have any opinions?


RE: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-10 Thread Jork Loeser
> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Thursday, August 10, 2017 12:28
> To: Jork Loeser 
> Cc: KY Srinivasan ; Simon Xiao ;
> Haiyang Zhang ; Stephen Hemminger
> ; torva...@linux-foundation.org; l...@kernel.org;
> h...@zytor.com; vkuzn...@redhat.com; linux-kernel@vger.kernel.org;
> rost...@goodmis.org; andy.shevche...@gmail.com; t...@linutronix.de;
> mi...@kernel.org; linux-tip-comm...@vger.kernel.org
> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB 
> flush

> > > > Hold on.. if we don't IPI for TLB invalidation. What serializes
> > > > our software page table walkers like fast_gup() ?
> > >
> > > Hypervisor may implement this functionality via an IPI.
> > >
> > > K. Y
> >
> > HvFlushVirtualAddressList() states:
> > This call guarantees that by the time control returns back to the
> > caller, the observable effects of all flushes on the specified virtual
> > processors have occurred.
> >
> > HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as 
> > adding
> sparse target VP lists.
> >
> > Is this enough of a guarantee, or do you see other races?
> 
> That's nowhere near enough. We need the remote CPU to have completed any
> guest IF section that was in progress at the time of the call.
> 
> So if a host IPI can interrupt a guest while the guest has IF cleared, and we 
> then
> process the host IPI -- clear the TLBs -- before resuming the guest, which 
> still has
> IF cleared, we've got a problem.
> 
> Because at that point, our software page-table walker, that relies on IF being
> clear to guarantee the page-tables exist, because it holds off the TLB 
> invalidate
> and thereby the freeing of the pages, gets its pages ripped out from under it.

I see, IF is used as a locking mechanism for the pages. Would 
CONFIG_HAVE_RCU_TABLE_FREE be an option for x86? There are caveats (statically 
enabled, RCU for page-free), yet if the resulting perf is still a gain it would 
be worthwhile for Hyper-V targeted kernels.

Regards,
Jork


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-10 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 07:08:22PM +, Jork Loeser wrote:

> > > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB 
> > > flush
> 
> > > Hold on.. if we don't IPI for TLB invalidation. What serializes our
> > > software page table walkers like fast_gup() ?
> > 
> > Hypervisor may implement this functionality via an IPI.
> > 
> > K. Y
> 
> HvFlushVirtualAddressList() states:
> This call guarantees that by the time control returns back to the
> caller, the observable effects of all flushes on the specified virtual
> processors have occurred.
> 
> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding 
> sparse target VP lists.
> 
> Is this enough of a guarantee, or do you see other races?

That's nowhere near enough. We need the remote CPU to have completed any
guest IF section that was in progress at the time of the call.

So if a host IPI can interrupt a guest while the guest has IF cleared,
and we then process the host IPI -- clear the TLBs -- before resuming the
guest, which still has IF cleared, we've got a problem.

Because at that point, our software page-table walker, that relies on IF
being clear to guarantee the page-tables exist, because it holds off the
TLB invalidate and thereby the freeing of the pages, gets its pages
ripped out from under it.


RE: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-10 Thread Jork Loeser
> -Original Message-
> From: KY Srinivasan


> > -Original Message-
> > From: Peter Zijlstra [mailto:pet...@infradead.org]
> > Sent: Thursday, August 10, 2017 11:57 AM
> > To: Simon Xiao ; Haiyang Zhang
> > ; Jork Loeser ;
> > Stephen Hemminger ; torvalds@linux-
> > foundation.org; l...@kernel.org; h...@zytor.com; vkuzn...@redhat.com;
> > linux-kernel@vger.kernel.org; rost...@goodmis.org;
> > andy.shevche...@gmail.com; t...@linutronix.de; KY Srinivasan
> > ; mi...@kernel.org
> > Cc: linux-tip-comm...@vger.kernel.org
> > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote
> > TLB flush

> > Hold on.. if we don't IPI for TLB invalidation. What serializes our
> > software page table walkers like fast_gup() ?
> 
> Hypervisor may implement this functionality via an IPI.
> 
> K. Y

HvFlushVirtualAddressList() states:
This call guarantees that by the time control returns back to the caller, the 
observable effects of all flushes on the specified virtual processors have 
occurred.

HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding 
sparse target VP lists.

Is this enough of a guarantee, or do you see other races?

Regards,
Jork



RE: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-10 Thread KY Srinivasan


> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Thursday, August 10, 2017 11:57 AM
> To: Simon Xiao ; Haiyang Zhang
> ; Jork Loeser ;
> Stephen Hemminger ; torvalds@linux-
> foundation.org; l...@kernel.org; h...@zytor.com; vkuzn...@redhat.com;
> linux-kernel@vger.kernel.org; rost...@goodmis.org;
> andy.shevche...@gmail.com; t...@linutronix.de; KY Srinivasan
> ; mi...@kernel.org
> Cc: linux-tip-comm...@vger.kernel.org
> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB
> flush
> 
> On Thu, Aug 10, 2017 at 11:21:49AM -0700, tip-bot for Vitaly Kuznetsov
> wrote:
> > Commit-ID:  2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
> > Gitweb:
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.kern
> el.org%2Ftip%2F2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb&data=02%7C
> 01%7Ckys%40microsoft.com%7C2537372f38d3414e999e08d4e0218ec8%7C72
> f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636379882129411812&sdata
> =odsJ2NnQdD8LCEtDPfVf5rL%2F2sQX4fKUhlqVSjKhjCI%3D&reserved=0
> > Author: Vitaly Kuznetsov 
> > AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
> > Committer:  Ingo Molnar 
> > CommitDate: Thu, 10 Aug 2017 20:16:44 +0200
> >
> > x86/hyper-v: Use hypercall for remote TLB flush
> >
> > Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
> > this is supposed to work faster than IPIs.
> >
> > Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
> > we need to put the input somewhere in memory and we don't really want
> to
> > have memory allocation on each call so we pre-allocate per cpu memory
> areas
> > on boot.
> >
> > pv_ops patching is happening very early so we need to separate
> > hyperv_setup_mmu_ops() and hyper_alloc_mmu().
> >
> > It is possible and easy to implement local TLB flushing too and there is
> > even a hint for that. However, I don't see a room for optimization on the
> > host side as both hypercall and native tlb flush will result in vmexit. The
> > hint is also not set on modern Hyper-V versions.
> 
> Hold on.. if we don't IPI for TLB invalidation. What serializes our
> software page table walkers like fast_gup() ?

Hypervisor may implement this functionality via an IPI.

K. Y


Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-10 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 11:21:49AM -0700, tip-bot for Vitaly Kuznetsov wrote:
> Commit-ID:  2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
> Gitweb: http://git.kernel.org/tip/2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
> Author: Vitaly Kuznetsov 
> AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
> Committer:  Ingo Molnar 
> CommitDate: Thu, 10 Aug 2017 20:16:44 +0200
> 
> x86/hyper-v: Use hypercall for remote TLB flush
> 
> Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
> this is supposed to work faster than IPIs.
> 
> Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
> we need to put the input somewhere in memory and we don't really want to
> have memory allocation on each call so we pre-allocate per cpu memory areas
> on boot.
> 
> pv_ops patching is happening very early so we need to separate
> hyperv_setup_mmu_ops() and hyper_alloc_mmu().
> 
> It is possible and easy to implement local TLB flushing too and there is
> even a hint for that. However, I don't see a room for optimization on the
> host side as both hypercall and native tlb flush will result in vmexit. The
> hint is also not set on modern Hyper-V versions.

Hold on.. if we don't IPI for TLB invalidation. What serializes our
software page table walkers like fast_gup() ?


[tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-10 Thread tip-bot for Vitaly Kuznetsov
Commit-ID:  2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
Gitweb: http://git.kernel.org/tip/2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
Author: Vitaly Kuznetsov 
AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
Committer:  Ingo Molnar 
CommitDate: Thu, 10 Aug 2017 20:16:44 +0200

x86/hyper-v: Use hypercall for remote TLB flush

Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
this is supposed to work faster than IPIs.

Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
we need to put the input somewhere in memory and we don't really want to
have memory allocation on each call so we pre-allocate per cpu memory areas
on boot.

pv_ops patching is happening very early so we need to separate
hyperv_setup_mmu_ops() and hyper_alloc_mmu().

It is possible and easy to implement local TLB flushing too and there is
even a hint for that. However, I don't see a room for optimization on the
host side as both hypercall and native tlb flush will result in vmexit. The
hint is also not set on modern Hyper-V versions.

Signed-off-by: Vitaly Kuznetsov 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Stephen Hemminger 
Cc: Andy Lutomirski 
Cc: Haiyang Zhang 
Cc: Jork Loeser 
Cc: K. Y. Srinivasan 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Simon Xiao 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: de...@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-8-vkuzn...@redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/hyperv/Makefile   |   2 +-
 arch/x86/hyperv/hv_init.c  |   2 +
 arch/x86/hyperv/mmu.c  | 138 +
 arch/x86/include/asm/mshyperv.h|   3 +
 arch/x86/include/uapi/asm/hyperv.h |   7 ++
 arch/x86/kernel/cpu/mshyperv.c |   1 +
 drivers/hv/Kconfig |   1 +
 7 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 171ae09..367a820 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1 +1 @@
-obj-y  := hv_init.o
+obj-y  := hv_init.o mmu.o
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e93b9a0..1a8eb55 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -140,6 +140,8 @@ void hyperv_init(void)
hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
 
+   hyper_alloc_mmu();
+
/*
 * Register Hyper-V specific clocksource.
 */
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
new file mode 100644
index 000..9419a20
--- /dev/null
+++ b/arch/x86/hyperv/mmu.c
@@ -0,0 +1,138 @@
+#define pr_fmt(fmt)  "Hyper-V: " fmt
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
+struct hv_flush_pcpu {
+   u64 address_space;
+   u64 flags;
+   u64 processor_mask;
+   u64 gva_list[];
+};
+
+/* Each gva in gva_list encodes up to 4096 pages to flush */
+#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
+
+static struct hv_flush_pcpu __percpu *pcpu_flush;
+
+/*
+ * Fills in gva_list starting from offset. Returns the number of items added.
+ */
+static inline int fill_gva_list(u64 gva_list[], int offset,
+   unsigned long start, unsigned long end)
+{
+   int gva_n = offset;
+   unsigned long cur = start, diff;
+
+   do {
+   diff = end > cur ? end - cur : 0;
+
+   gva_list[gva_n] = cur & PAGE_MASK;
+   /*
+* Lower 12 bits encode the number of additional
+* pages to flush (in addition to the 'cur' page).
+*/
+   if (diff >= HV_TLB_FLUSH_UNIT)
+   gva_list[gva_n] |= ~PAGE_MASK;
+   else if (diff)
+   gva_list[gva_n] |= (diff - 1) >> PAGE_SHIFT;
+
+   cur += HV_TLB_FLUSH_UNIT;
+   gva_n++;
+
+   } while (cur < end);
+
+   return gva_n - offset;
+}
+
+static void hyperv_flush_tlb_others(const struct cpumask *cpus,
+   const struct flush_tlb_info *info)
+{
+   int cpu, vcpu, gva_n, max_gvas;
+   struct hv_flush_pcpu *flush;
+   u64 status = U64_MAX;
+   unsigned long flags;
+
+   if (!pcpu_flush || !hv_hypercall_pg)
+   goto do_native;
+
+   if (cpumask_empty(cpus))
+   return;
+
+   local_irq_save(flags);
+
+   flush = this_cpu_ptr(pcpu_flush);
+
+   if (info->mm) {
+   flush->address_space = virt_to_phys(info->mm->pgd);
+   flush->flags = 0;
+   } else {
+   flush->address_space = 0;
+   flush->flags = HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES;
+   }
+
+   flush->processor_mask = 0;
+   if (cpumask_equal(cpus, cpu_present_mask)) {
+   flush->f

[tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

2017-08-10 Thread tip-bot for Vitaly Kuznetsov
Commit-ID:  88b46342eb037d35decda4d651cfee5216f4f822
Gitweb: http://git.kernel.org/tip/88b46342eb037d35decda4d651cfee5216f4f822
Author: Vitaly Kuznetsov 
AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
Committer:  Ingo Molnar 
CommitDate: Thu, 10 Aug 2017 16:50:23 +0200

x86/hyper-v: Use hypercall for remote TLB flush

Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
this is supposed to work faster than IPIs.

Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
we need to put the input somewhere in memory and we don't really want to
have memory allocation on each call so we pre-allocate per cpu memory areas
on boot.

pv_ops patching is happening very early so we need to separate
hyperv_setup_mmu_ops() and hyper_alloc_mmu().

It is possible and easy to implement local TLB flushing too and there is
even a hint for that. However, I don't see a room for optimization on the
host side as both hypercall and native tlb flush will result in vmexit. The
hint is also not set on modern Hyper-V versions.

Signed-off-by: Vitaly Kuznetsov 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Stephen Hemminger 
Cc: Andy Lutomirski 
Cc: Haiyang Zhang 
Cc: Jork Loeser 
Cc: K. Y. Srinivasan 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Simon Xiao 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: de...@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-8-vkuzn...@redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/hyperv/Makefile   |   2 +-
 arch/x86/hyperv/hv_init.c  |   2 +
 arch/x86/hyperv/mmu.c  | 138 +
 arch/x86/include/asm/mshyperv.h|   3 +
 arch/x86/include/uapi/asm/hyperv.h |   7 ++
 arch/x86/kernel/cpu/mshyperv.c |   1 +
 6 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 171ae09..367a820 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1 +1 @@
-obj-y  := hv_init.o
+obj-y  := hv_init.o mmu.o
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e93b9a0..1a8eb55 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -140,6 +140,8 @@ void hyperv_init(void)
hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
 
+   hyper_alloc_mmu();
+
/*
 * Register Hyper-V specific clocksource.
 */
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
new file mode 100644
index 000..9419a20
--- /dev/null
+++ b/arch/x86/hyperv/mmu.c
@@ -0,0 +1,138 @@
+#define pr_fmt(fmt)  "Hyper-V: " fmt
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
+struct hv_flush_pcpu {
+   u64 address_space;
+   u64 flags;
+   u64 processor_mask;
+   u64 gva_list[];
+};
+
+/* Each gva in gva_list encodes up to 4096 pages to flush */
+#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
+
+static struct hv_flush_pcpu __percpu *pcpu_flush;
+
+/*
+ * Fills in gva_list starting from offset. Returns the number of items added.
+ */
+static inline int fill_gva_list(u64 gva_list[], int offset,
+   unsigned long start, unsigned long end)
+{
+   int gva_n = offset;
+   unsigned long cur = start, diff;
+
+   do {
+   diff = end > cur ? end - cur : 0;
+
+   gva_list[gva_n] = cur & PAGE_MASK;
+   /*
+* Lower 12 bits encode the number of additional
+* pages to flush (in addition to the 'cur' page).
+*/
+   if (diff >= HV_TLB_FLUSH_UNIT)
+   gva_list[gva_n] |= ~PAGE_MASK;
+   else if (diff)
+   gva_list[gva_n] |= (diff - 1) >> PAGE_SHIFT;
+
+   cur += HV_TLB_FLUSH_UNIT;
+   gva_n++;
+
+   } while (cur < end);
+
+   return gva_n - offset;
+}
+
+static void hyperv_flush_tlb_others(const struct cpumask *cpus,
+   const struct flush_tlb_info *info)
+{
+   int cpu, vcpu, gva_n, max_gvas;
+   struct hv_flush_pcpu *flush;
+   u64 status = U64_MAX;
+   unsigned long flags;
+
+   if (!pcpu_flush || !hv_hypercall_pg)
+   goto do_native;
+
+   if (cpumask_empty(cpus))
+   return;
+
+   local_irq_save(flags);
+
+   flush = this_cpu_ptr(pcpu_flush);
+
+   if (info->mm) {
+   flush->address_space = virt_to_phys(info->mm->pgd);
+   flush->flags = 0;
+   } else {
+   flush->address_space = 0;
+   flush->flags = HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES;
+   }
+
+   flush->processor_mask = 0;
+   if (cpumask_equal(cpus, cpu_present_mask)) {
+   flush->flags |= HV_FLUSH_ALL_PROCESSORS;
+   } e