[Xen-devel] [PATCH-for-4.10] adapt SUPPORT.md to match 4.11

2018-04-25 Thread Juergen Gross
Some tags have been changed in 4.11. Adapt the 4.10 ones to match in
order to produce an easier to read support HTML table.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 SUPPORT.md | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/SUPPORT.md b/SUPPORT.md
index cb862b538d..2b0d58feb3 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -74,15 +74,15 @@ No hardware requirements
 
 ### x86/HVM
 
-Status: Supported
+Status, domU: Supported
 
 Fully virtualised guest using hardware virtualisation extensions
 
 Requires hardware virtualisation support (Intel VMX / AMD SVM)
 
-### x86/PVH guest
+### x86/PVH
 
-Status: Supported
+Status, domU: Supported
 
 PVH is a next-generation paravirtualized mode
 designed to take advantage of hardware virtualization support when possible.
@@ -90,7 +90,7 @@ During development this was sometimes called HVMLite or PVHv2.
 
 Requires hardware virtualisation support (Intel VMX / AMD SVM)
 
-### ARM guest
+### ARM
 
 Status: Supported
 
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH-for-4.10] adapt SUPPORT.md to match 4.11

2018-04-25 Thread Juergen Gross
On 25/04/18 16:09, Ian Jackson wrote:
> Juergen Gross writes ("[PATCH-for-4.10] adapt SUPPORT.md to match 4.11"):
>> Some tags have been changed in 4.11. Adapt the 4.10 ones to match in
>> order to produce an easier to read support HTML table.
> 
> This does not seem to appply on top of my own 4.10 series.

Oh, my bad. I built it on top of staging-4.10.

I'll resend as soon as your series has been applied.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v8 1/9] x86/xpti: avoid copying L4 page table contents when possible

2018-04-22 Thread Juergen Gross
On 22/04/18 18:39, Tim Deegan wrote:
> At 19:11 +0200 on 21 Apr (1524337893), Juergen Gross wrote:
>> On 21/04/18 15:32, Tim Deegan wrote:
>>> At 09:44 +0200 on 19 Apr (1524131080), Juergen Gross wrote:
>>>> Another alternative would be to pass another flag to the callers to
>>>> signal the need for a flush. This would require quite some modifications
>>>> to shadow code I'd like to avoid, though. OTOH this way we could combine
>>>> flushing the tlb and the root page tables. Tim, any preferences?
>>>
>>> This sounds a promising direction but it should be doabl without major
>>> surgery to the shadow code.  The shadow code already leaves old sl4es
>>> visible (in TLBs) when it's safe to do so, so I think the right place
>>> to hook this is on the receiving side of the TLB flush IPI.  IOW as
>>> long as:
>>>  - you copy the L4 on context switch; and
>>>  - you copy it on the TLB flush IPI is received
>>> then you can rely on the existing TLB flush mechanisms to do what you need.
>>> And shadow doesn't have to behave differently from 'normal' PV MM.
>>
>> It is not so easy. The problem is that e.g. a page fault will flush the
>> TLB entry for the page in question, but it won't lead to the L4 to be
>> copied.
> 
> Oh yes, I see; thanks for the explanation.  It might be worth copying
> what the hardware does here, and checking/propagating the relevant l4e
> in the PV pagefault handler.

While in the long run being an interesting option I'm not sure I want
to go this route for 4.11. There might be nasty corner cases and I think
such a lazy approach is much more error prone than doing explicit
updates of the L4 table on the affected cpus in case of a modified
entry. After all this is the way we are doing it for the non-shadow
case, too.

I think we should either do the explicit call of flush_mask() in
shadow_set_l4e() or propagate the need for the flush up to the caller.

In case you want to discuss this problem in more detail I think we
should do that on IRC for speeding things up a little bit.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Bug 198497] handle_mm_fault / xen_pmd_val / radix_tree_lookup_slot Null pointer

2018-04-23 Thread Juergen Gross
On 20/04/18 17:20, Jason Andryuk wrote:
> Adding xen-devel and the Linux Xen maintainers.
> 
> Summary: Some Xen users (and maybe others) are hitting a BUG in
> __radix_tree_lookup() under do_swap_page() - example backtrace is
> provided at the end.  Matthew Wilcox provided a band-aid patch that
> prints errors like the following instead of triggering the bug.
> 
> Skylake 32bit PAE Dom0:
> Bad swp_entry: 8000
> mm/swap_state.c:683: bad pte d3a39f1c(8004)
> 
> Ivy Bridge 32bit PAE Dom0:
> Bad swp_entry: 4000
> mm/swap_state.c:683: bad pte d3a05f1c(8002)
> 
> Other 32bit DomU:
> Bad swp_entry: 400
> mm/swap_state.c:683: bad pte e2187f30(8002)
> 
> Other 32bit:
> Bad swp_entry: 200
> mm/swap_state.c:683: bad pte ef3a3f38(8001)
> 
> The Linux bugzilla has more info
> https://bugzilla.kernel.org/show_bug.cgi?id=198497
> 
> This may not be exclusive to Xen Linux, but most of the reports are on
> Xen.  Matthew wonders if Xen might be stepping on the upper bits of a
> pte.
> 
> On Fri, Apr 20, 2018 at 9:39 AM, Matthew Wilcox  wrote:
>> On Fri, Apr 20, 2018 at 09:10:11AM -0400, Jason Andryuk wrote:
 Given that this is happening on Xen, I wonder if Xen is using some of the
 bits in the page table for its own purposes.
>>>
>>> The backtraces include do_swap_page().  While I have a swap partition
>>> configured, I don't think it's being used.  Are we somehow
>>> misidentifying the page as a swap page?  I'm not familiar with the
>>> code, but is there an easy way to query global swap usage?  That way
>>> we can see if the check for a swap page is bogus.
>>>
>>> My system works with the band-aid patch.  When that patch sets page =
>>> NULL, does that mean userspace is just going to get a zero-ed page?
>>> Userspace still works AFAICT, which makes me think it is a
>>> mis-identified page to start with.
>>
>> Here's how this code works.
> 
> Thanks for the description.
> 
>> When we swap out an anonymous page (a page which is not backed by a
>> file; could be from a MAP_PRIVATE mapping, could be brk()), we write it
>> to the swap cache.  In order to be able to find it again, we store a
>> cookie (called a swp_entry_t) in the process' page table (marked with
>> the 'present' bit clear, so the CPU will fault on it).  When we get a
>> fault, we look up the cookie in a radix tree and bring that page back
>> in from swap.
>>
>> If there's no page found in the radix tree, we put a freshly zeroed
>> page into the process's address space.  That's because we won't find
>> a page in the swap cache's radix tree for the first time we fault.
>> It's not an indication of a bug if there's no page to be found.
> 
> Is "no page found" the case for a lazy, un-allocated MAP_ANONYMOUS page?
> 
>> What we're seeing for this bug is page table entries of the format
>> 0x8000'0004''.  That would be a zeroed entry, except for the
>> fact that something's stepped on the upper bits.
> 
> Does a totally zero-ed entry correspond to an un-allocated MAP_ANONYMOUS page?
> 
>> What is worrying is that potentially Xen might be stepping on the upper
>> bits of either a present entry (leading to the process loading a page
>> that belongs to someone else) or an entry which has been swapped out,
>> leading to the process getting a zeroed page when it should be getting
>> its page back from swap.
> 
> There was at least one report of non-Xen 32bit being affected.  There
> was no backtrace, so it could be something else.  One report doesn't
> have any swap configured.
> 
>> Defending against this kind of corruption would take adding a parity
>> bit to the page tables.  That's not a project I have time for right now.
> 
> Understood.  Thanks for the response.
> 
> Regards,
> Jason
> 
> 
> [ 2234.939079] BUG: unable to handle kernel NULL pointer dereference at 
> 0008
> [ 2234.942154] IP: __radix_tree_lookup+0xe/0xa0
> [ 2234.945176] *pdpt = 08cd5027 *pde = 
> [ 2234.948382] Oops:  [#1] SMP
> [ 2234.951410] Modules linked in: hp_wmi sparse_keymap rfkill wmi_bmof
> pcspkr i915 wmi hp_accel lis3lv02d input_polldev drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops drm hp_wireless
> i2c_algo_bit hid_multitouch sha256_generic xen_netfront v4v(O) psmouse
> ecb xts hid_generic xhci_pci xhci_hcd ohci_pci ohci_hcd uhci_hcd
> ehci_pci ehci_hcd usbhid hid tpm_tis tpm_tis_core tpm
> [ 2234.960816] CPU: 1 PID: 2338 Comm: xenvm Tainted: G   O4.14.18 
> #1
> [ 2234.963991] Hardware name: Hewlett-Packard HP EliteBook Folio
> 9470m/18DF, BIOS 68IBD Ver. F.40 02/01/2013
> [ 2234.967186] task: d4370980 task.stack: cf8e8000
> [ 2234.970351] EIP: __radix_tree_lookup+0xe/0xa0
> [ 2234.973520] EFLAGS: 00010286 CPU: 1
> [ 2234.976699] EAX: 0004 EBX: b590 ECX:  EDX: 
> [ 2234.979887] ESI:  EDI: 0004 EBP: cf8e9dd0 ESP: cf8e9dc0
> [ 2234.983081]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
> [ 

Re: [Xen-devel] [PATCH v2] Input: xen-kbdfront - allow better run-time configuration

2018-04-23 Thread Juergen Gross
On 23/04/18 10:02, Oleksandr Andrushchenko wrote:
> Juergen, Jason, Dmitry
> any comment on this?

Oleksandr, please give us some time. I can't speak for others, but I am
not sitting here idling and hoping that some work (e.g. patches to
review) might appear.

I have a lot of other stuff to do and will respond when I find some time
to look at your patches.

Pinging others on Monday when having sent out the patch only on Thursday
is rather unfriendly.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH-for-4.11] xpti: fix bug in double fault handling

2018-04-23 Thread Juergen Gross
When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following interrupt, e.g. after re-enabling interrupts
in machine_restart().

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 tools/firmware/xen-dir/shim.config | 2 +-
 xen/arch/x86/x86_64/entry.S| 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/firmware/xen-dir/shim.config 
b/tools/firmware/xen-dir/shim.config
index 4d5630f87a..0b9e001436 100644
--- a/tools/firmware/xen-dir/shim.config
+++ b/tools/firmware/xen-dir/shim.config
@@ -1,6 +1,6 @@
 #
 # Automatically generated file; DO NOT EDIT.
-# Xen/x86 4.11-unstable Configuration
+# Xen/x86 4.11-rc Configuration
 #
 CONFIG_X86_64=y
 CONFIG_X86=y
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 45d9842d09..d12fbcb55d 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -772,6 +772,7 @@ ENTRY(double_fault)
 jns   .Ldblf_cr3_load
 neg   %rbx
 .Ldblf_cr3_load:
+movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
 mov   %rbx, %cr3
 .Ldblf_cr3_okay:
 
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH-for-4.11 v2] xpti: fix bug in double fault handling

2018-04-23 Thread Juergen Gross
On 23/04/18 14:49, Jan Beulich wrote:
>>>> On 23.04.18 at 14:38, <jgr...@suse.com> wrote:
>> When entering the hypervisor via the double fault handler resetting
>> xen_cr3 was missing. This led to switching to pv_cr3 when returning
>> from the next following interrupt. So repair this in order to allow
>> interrupt handling to work even after a double fault.
>>
>> Signed-off-by: Juergen Gross <jgr...@suse.com>
> 
> With s/interrupt/exception/ (or s/interrupt/interruption/) on the
> description (which is easy enough to do while committing)
> Reviewed-by: Jan Beulich <jbeul...@suse.com>
> 
> I take it that your Rab can be implied here.

Correct. :-)


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH-for-4.11 v2] xpti: fix bug in double fault handling

2018-04-23 Thread Juergen Gross
On 23/04/18 15:06, Andrew Cooper wrote:
> On 23/04/18 13:38, Juergen Gross wrote:
>> When entering the hypervisor via the double fault handler resetting
>> xen_cr3 was missing. This led to switching to pv_cr3 when returning
>> from the next following interrupt. So repair this in order to allow
>> interrupt handling to work even after a double fault.
>>
>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>> ---
>>  xen/arch/x86/x86_64/entry.S | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
>> index 45d9842d09..25427b0cec 100644
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -773,6 +773,7 @@ ENTRY(double_fault)
>>  neg   %rbx
>>  .Ldblf_cr3_load:
>>  mov   %rbx, %cr3
>> +movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>>  .Ldblf_cr3_okay:
>>  
>>  movq  %rsp,%rdi
> 
> What about the other write into xen_cr3 with a negated value?  Won't
> this still explode if we get an NMI or MCE at the wrong moment?

Hmm, you mean a NMI between the mov to %cr3 and zeroing xen_cr3?
Could be an issue, yes. Okay, V3 then...


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86: correct assertion in destroy_perdomain_mapping()

2018-04-23 Thread Juergen Gross
On 23/04/18 15:39, Andrew Cooper wrote:
> On 20/04/18 16:03, Jan Beulich wrote:
>> hvm_domain_initialise() may call this with nr being zero, which triggers
>> the "does not cross L3 boundary" check.
>>
>> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> 
> Acked-by: Andrew Cooper <andrew.coop...@citrix.com>
> 

Release-acked-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH-for-4.11] xpti: fix bug in double fault handling

2018-04-23 Thread Juergen Gross
On 23/04/18 14:11, Jan Beulich wrote:
 On 23.04.18 at 13:37,  wrote:
>> When entering the hypervisor via the double fault handler resetting
>> xen_cr3 was missing. This led to switching to pv_cr3 when returning
>> from the next following interrupt, e.g. after re-enabling interrupts
>> in machine_restart().
> 
> Pointing at bad behavior to justify a change is not very helpful, I think.
> Andrew's argument of exception handling wanting to continue to work
> even after a #DF is a better one imo.

Okay.

> 
>> --- a/tools/firmware/xen-dir/shim.config
>> +++ b/tools/firmware/xen-dir/shim.config
>> @@ -1,6 +1,6 @@
>>  #
>>  # Automatically generated file; DO NOT EDIT.
>> -# Xen/x86 4.11-unstable Configuration
>> +# Xen/x86 4.11-rc Configuration
>>  #
>>  CONFIG_X86_64=y
>>  CONFIG_X86=y
> 
> Stray change?

Oh, that one again. Sorry. Maybe we should really take my related
patch to avoid issues like this one.

> 
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -772,6 +772,7 @@ ENTRY(double_fault)
>>  jns   .Ldblf_cr3_load
>>  neg   %rbx
>>  .Ldblf_cr3_load:
>> +movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>>  mov   %rbx, %cr3
>>  .Ldblf_cr3_okay:
> 
> Just like for the other code paths this write should be after the CR3 load,
> or else NMI or #MC occurring between the two would fail to update CR3.

Aah, right.

Will send V2 soon.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH-for-4.11 v2] xpti: fix bug in double fault handling

2018-04-23 Thread Juergen Gross
When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following interrupt. So repair this in order to allow
interrupt handling to work even after a double fault.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 xen/arch/x86/x86_64/entry.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 45d9842d09..25427b0cec 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -773,6 +773,7 @@ ENTRY(double_fault)
 neg   %rbx
 .Ldblf_cr3_load:
 mov   %rbx, %cr3
+movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
 .Ldblf_cr3_okay:
 
 movq  %rsp,%rdi
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3] xpti: fix bug in double fault handling

2018-04-23 Thread Juergen Gross
When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following exception, so repair this in order to allow
exception handling to work even after a double fault.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 xen/arch/x86/x86_64/entry.S | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 45d9842d09..1cd7d93892 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -767,12 +767,14 @@ ENTRY(double_fault)
 /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
 mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
-test  %rbx, %rbx
+neg   %rbx
 jz.Ldblf_cr3_okay
 jns   .Ldblf_cr3_load
+mov   %rbx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
 neg   %rbx
 .Ldblf_cr3_load:
 mov   %rbx, %cr3
+movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
 .Ldblf_cr3_okay:
 
 movq  %rsp,%rdi
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/2] To introduce xenwatch multithreading (xen mtwatch)

2018-04-24 Thread Juergen Gross
On 24/04/18 07:52, Dongli Zhang wrote:
> Hi Juergen,
> 
> On 04/24/2018 01:22 PM, Juergen Gross wrote:
>> On 24/04/18 01:55, Dongli Zhang wrote:
>>> Hi Wei,
>>>
>>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>>>> About per-domU xenwatch thread create/destroy, a new type of xenstore 
>>>>> node is
>>>>> introduced: '/local/domain/0/mtwatch/'.
>>>>>
>>>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the 
>>>>> insertion
>>>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>>>
>>>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'.  
>>>>> Kernel
>>>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is 
>>>>> inserted,
>>>>> while this kernel thread is destroyed when the corresponding xenstore 
>>>>> node is
>>>>> removed.
>>>>
>>>> Instead of inventing yet another node, can you not watch /local/domain
>>>> directly?
>>>
>>> Would you like to watch at /local/domain directly? Or is your question "is 
>>> there
>>> any other way to not watch at /local/domain, while no extra xenstore node 
>>> will
>>> be introduced"?
>>>
>>> Actually, the first prototype of this idea was to watch at /local/domain
>>> directly to get aware of the domU create/destroy, so that xen toolstack 
>>> will not
>>> get involved. Joao Martins (CCed) had a concern on the performance as 
>>> watching
>>> at /local/domain would lead to large amount of xenwatch events.
>>
>> That's what the special watches "@introduceDomain" and "@releaseDomain"
>> are meant for.
> 
> I used to consider to watch at "@introduceDomain". However, there is no domain
> information appended with "@introduceDomain" and it is still required for dom0
> kernel to proactively confirm who is created.

That isn't too hard, right? You just need to read /local/domain to get
the list of its children and look for new domains there.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 2/7] doc: add architecture qualifier to boot parameter entries

2018-04-24 Thread Juergen Gross
Many of the architecture specific boot parameters are not qualified
as such. Correct that.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 docs/misc/xen-command-line.markdown | 168 ++--
 1 file changed, 84 insertions(+), 84 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index d175e6070b..a6b555d0b0 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -110,7 +110,7 @@ domain 0 command line
 Specify which ACPI MADT table to parse for APIC information, if more
 than one is present.
 
-### acpi\_pstate\_strict
+### acpi\_pstate\_strict (x86)
 > `= `
 
 > Default: `false`
@@ -119,12 +119,12 @@ Enforce checking that P-state transitions by the ACPI 
cpufreq driver
 actually result in the nominated frequency to be established. A warning
 message will be logged if that isn't the case.
 
-### acpi\_skip\_timer\_override
+### acpi\_skip\_timer\_override (x86)
 > `= `
 
 Instruct Xen to ignore timer-interrupt override.
 
-### acpi\_sleep
+### acpi\_sleep (x86)
 > `= s3_bios | s3_mode`
 
 `s3_bios` instructs Xen to invoke video BIOS initialization during S3
@@ -133,7 +133,7 @@ resume.
 `s3_mode` instructs Xen to set up the boot time (option `vga=`) video
 mode during S3 resume.
 
-### allow\_unsafe
+### allow\_unsafe (x86)
 > `= `
 
 > Default: `false`
@@ -152,14 +152,14 @@ to boot on systems with the following errata:
 
 Permit multiple copies of host p2m.
 
-### apic
+### apic (x86)
 > `= bigsmp | default`
 
 Override Xen's logic for choosing the APIC driver.  By default, if
 there are more than 8 CPUs, Xen will switch to `bigsmp` over
 `default`.
 
-### apicv
+### apicv (Intel)
 > `= `
 
 > Default: `true`
@@ -168,12 +168,12 @@ Permit Xen to use APIC Virtualisation Extensions.  This 
is an optimisation
 available as part of VT-x, and allows hardware to take care of the guests APIC
 handling, rather than requiring emulation in Xen.
 
-### apic\_verbosity
+### apic\_verbosity (x86)
 > `= verbose | debug`
 
 Increase the verbosity of the APIC code from the default value.
 
-### arat
+### arat (x86)
 > `= `
 
 > Default: `true`
@@ -182,7 +182,7 @@ Permit Xen to use "Always Running APIC Timer" support on 
compatible hardware
 in combination with cpuidle.  This option is only expected to be useful for
 developers wishing Xen to fall back to older timing methods on newer hardware.
 
-### asid
+### asid (x86)
 > `= `
 
 > Default: `true`
@@ -191,7 +191,7 @@ Permit Xen to use Address Space Identifiers.  This is an 
optimisation which
 tags the TLB entries with an ID per vcpu.  This allows for guest TLB flushes
 to be performed without the overhead of a complete TLB flush.
 
-### async-show-all
+### async-show-all (x86)
 > `= `
 
 > Default: `false`
@@ -199,7 +199,7 @@ to be performed without the overhead of a complete TLB 
flush.
 Forces all CPUs' full state to be logged upon certain fatal asynchronous
 exceptions (watchdog NMIs and unexpected MCEs).
 
-### ats
+### ats (x86)
 > `= `
 
 > Default: `false`
@@ -276,7 +276,7 @@ when the RSB gets overwritten.  The former control all RSB 
overwriting, while
 the latter two can be used to fine tune overwriting on from HVM context, and
 an entry from a native (PV or Xen) context.
 
-### clocksource
+### clocksource (x86)
 > `= pit | hpet | acpi | tsc`
 
 If set, override Xen's default choice for the platform timer.
@@ -287,7 +287,7 @@ the number of allowed CPUs.  When running on platforms that 
can guarantee a
 monotonic TSC across sockets you may want to adjust the "tsc" command line
 parameter to "stable:socket".
 
-### cmci-threshold
+### cmci-threshold (Intel)
 > `= `
 
 > Default: `2`
@@ -295,7 +295,7 @@ parameter to "stable:socket".
 Specify the event count threshold for raising Corrected Machine Check
 Interrupts.  Specifying zero disables CMCI handling.
 
-### cmos-rtc-probe
+### cmos-rtc-probe (x86)
 > `= `
 
 > Default: `false`
@@ -459,7 +459,7 @@ character, but for xen to keep the console.
 
 > Default: `power`
 
-### cpu\_type
+### cpu\_type (x86)
 > `= arch_perfmon`
 
 If set, force use of the performance counters for oprofile, rather than 
detecting
@@ -499,7 +499,7 @@ pre-canned cpuid mask to mask the current processor down to 
appear as
 the specified processor. It is important to ensure that all hosts in a
 pool appear the same to guests to allow successful live migration.
 
-### cpuid\_mask\_{{,ext\_}ecx,edx}
+### cpuid\_mask\_{{,ext\_}ecx,edx} (x86)
 > `= `
 
 > Default: `~0` (all bits set)
@@ -529,10 +529,10 @@ These three command line parameters are also used to 
specify cpuid
 masks to help with cpuid levelling across a pool of hosts.  See the
 description of the other respective options above.
 
-### cpuidle
+### cpuidle (x86)
 > `= `
 
-### cpuinfo
+### cpuinfo (x86)
 > `= `
 
 ### cras

[Xen-devel] [PATCH 7/7] doc: correct intel_psr_cat_cdp.pandoc syntax

2018-04-24 Thread Juergen Gross
"make -C docs all" fails due to incorrect markdown syntax in
intel_psr_cat_cdp.pandoc. Correct it.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 docs/features/intel_psr_cat_cdp.pandoc | 366 -
 1 file changed, 175 insertions(+), 191 deletions(-)

diff --git a/docs/features/intel_psr_cat_cdp.pandoc 
b/docs/features/intel_psr_cat_cdp.pandoc
index 04fb256dd9..c619e2cc99 100644
--- a/docs/features/intel_psr_cat_cdp.pandoc
+++ b/docs/features/intel_psr_cat_cdp.pandoc
@@ -104,19 +104,18 @@ PSR infrastructure in Xen.
   CAT/CDP defines a range of MSRs to assign different cache access patterns
   which are known as CBMs, each CBM is associated with a COS.
 
-  ```
   E.g. L2 CAT:
-  +++
- IA32_PQR_ASSOC   | MSR (per socket)   |Address |
-   ++---+---+ +++
-   ||COS|   | | IA32_L2_QOS_MASK_0 | 0xD10  |
-   ++---+---+ +++
-  +-> | ...|  ...   |
-  +++
-  | IA32_L2_QOS_MASK_n | 0xD10+n (n<64) |
-  +++
-  ```
-
+
+  +++
+ IA32_PQR_ASSOC   | MSR (per socket)   |Address |
+   ++---+---+ +++
+   ||COS|   | | IA32_L2_QOS_MASK_0 | 0xD10  |
+   ++---+---+ +++
+  +-> | ...|  ...   |
+  +++
+  | IA32_L2_QOS_MASK_n | 0xD10+n (n<64) |
+  +++
+
   L3 CAT/CDP uses a range of MSRs from 0xC90 ~ 0xC90+n (n<128).
 
   L2 CAT uses a range of MSRs from 0xD10 ~ 0xD10+n (n<64), following the L3
@@ -132,41 +131,39 @@ PSR infrastructure in Xen.
   note that all (and only) contiguous '1' combinations are allowed (e.g. H,
   0FF0H, 003CH, etc.).
 
-  ```
-   +++++++++
-   | M7 | M6 | M5 | M4 | M3 | M2 | M1 | M0 |
-   +++++++++
-  COS0 | A  | A  | A  | A  | A  | A  | A  | A  | Default Bitmask
-   +++++++++
-  COS1 | A  | A  | A  | A  | A  | A  | A  | A  |
-   +++++++++
-  COS2 | A  | A  | A  | A  | A  | A  | A  | A  |
-   +++++++++
-
-   +++++++++
-   | M7 | M6 | M5 | M4 | M3 | M2 | M1 | M0 |
-   +++++++++
-  COS0 | A  | A  | A  | A  | A  | A  | A  | A  | Overlapped Bitmask
-   +++++++++
-  COS1 ||||| A  | A  | A  | A  |
-   +++++++++
-  COS2 ||||||| A  | A  |
-   +++++++++
-
-   +++++++++
-   | M7 | M6 | M5 | M4 | M3 | M2 | M1 | M0 |
-   +++++++++
-  COS0 | A  | A  | A  | A  ||||| Isolated Bitmask
-   +++++++++
-  COS1 ||||| A  | A  |||
-   +++++++++
-  COS2 ||||||| A  | A  |
-   +++++++++
-  ```
+   +++++++++
+   | M7 | M6 | M5 | M4 | M3 | M2 | M1 | M0 |
+   +++++++++
+  COS0 | A  | A  | A  | A  | A  | A  | A  | A  | Default Bitmask
+   +++++++++
+  COS1 | A  | A  | A  | A  | A  | A  | A  | A  |
+   +++++++++
+  COS2 | A  | A  | A  | A  | A  | A  | A  | A  |
+   +++++++++
+
+   +++++++++
+   | M7 | M6 | M5 | M4 | M3 | M2 | M1 | M0 |
+   +++++++++
+  COS0 | A  | A  | A  | A  | A  | A  | A  | A  | Overlapped Bitmask
+   +++++++++
+  COS1 ||||| A  | A  | A  | A  |
+   +++++++++
+  COS2 ||||||| A  | A  |
+   +++++++++
+
+   +++++++++
+   | M7 

[Xen-devel] [PATCH 1/7] doc: sort entries of boot parameters correctly

2018-04-24 Thread Juergen Gross
Some of the boot parameters in docs/misc/xen-command-line.markdown are
not in the correct alphabetically order. Correct that.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 docs/misc/xen-command-line.markdown | 226 ++--
 1 file changed, 113 insertions(+), 113 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index b353352adf..d175e6070b 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -133,6 +133,18 @@ resume.
 `s3_mode` instructs Xen to set up the boot time (option `vga=`) video
 mode during S3 resume.
 
+### allow\_unsafe
+> `= `
+
+> Default: `false`
+
+Force boot on potentially unsafe systems. By default Xen will refuse
+to boot on systems with the following errata:
+
+* AMD Erratum 121. Processors with this erratum are subject to a guest
+  triggerable Denial of Service. Override only if you trust all of
+  your PV guests.
+
 ### altp2m (Intel)
 > `= `
 
@@ -147,18 +159,6 @@ Override Xen's logic for choosing the APIC driver.  By 
default, if
 there are more than 8 CPUs, Xen will switch to `bigsmp` over
 `default`.
 
-### allow\_unsafe
-> `= `
-
-> Default: `false`
-
-Force boot on potentially unsafe systems. By default Xen will refuse
-to boot on systems with the following errata:
-
-* AMD Erratum 121. Processors with this erratum are subject to a guest
-  triggerable Denial of Service. Override only if you trust all of
-  your PV guests.
-
 ### apicv
 > `= `
 
@@ -276,17 +276,6 @@ when the RSB gets overwritten.  The former control all RSB 
overwriting, while
 the latter two can be used to fine tune overwriting on from HVM context, and
 an entry from a native (PV or Xen) context.
 
-### xenheap\_megabytes (arm32)
-> `= `
-
-> Default: `0` (1/32 of RAM)
-
-Amount of RAM to set aside for the Xenheap. Must be an integer multiple of 32.
-
-By default will use 1/32 of the RAM up to a maximum of 1GB and with a
-minimum of 32M, subject to a suitably aligned and sized contiguous
-region of memory being available.
-
 ### clocksource
 > `= pit | hpet | acpi | tsc`
 
@@ -658,6 +647,24 @@ trace feature is only enabled in debugging builds of Xen.
 
 Specify the bit width of the DMA heap.
 
+### dom0
+> `= List of [ pvh | shadow ]`
+
+> Sub-options:
+
+> `pvh`
+
+> Default: `false`
+
+Flag that makes a dom0 boot in PVHv2 mode.
+
+> `shadow`
+
+> Default: `false`
+
+Flag that makes a dom0 use shadow paging. Only works when "pvh" is
+enabled.
+
 ### dom0\_ioports\_disable
 > `= List of -`
 
@@ -750,24 +757,6 @@ affinities to prefer but be not limited to the specified 
node(s).
 
 Pin dom0 vcpus to their respective pcpus
 
-### dom0
-> `= List of [ pvh | shadow ]`
-
-> Sub-options:
-
-> `pvh`
-
-> Default: `false`
-
-Flag that makes a dom0 boot in PVHv2 mode.
-
-> `shadow`
-
-> Default: `false`
-
-Flag that makes a dom0 use shadow paging. Only works when "pvh" is
-enabled.
-
 ### dtuart (ARM)
 > `= path [:options]`
 
@@ -824,6 +813,30 @@ effect the inverse meaning.
 >> Allows mapping of RuntimeServices which have no cachability attribute
 >> set as UC.
 
+### ept (Intel)
+> `= List of ( {no-}pml | {no-}ad )`
+
+Controls EPT related features.
+
+> Sub-options:
+
+> `pml`
+
+> Default: `true`
+
+>> PML is a new hardware feature in Intel's Broadwell Server and further
+>> platforms which reduces hypervisor overhead of log-dirty mechanism by
+>> automatically recording GPAs (guest physical addresses) when guest memory
+>> gets dirty, and therefore significantly reducing number of EPT violation
+>> caused by write protection of guest memory, which is a necessity to
+>> implement log-dirty mechanism before PML.
+
+> `ad`
+
+> Default: Hardware dependent
+
+>> Have hardware keep accessed/dirty (A/D) bits updated.
+
 ### extra\_guest\_irqs
 > `= [][,]`
 
@@ -889,30 +902,6 @@ requirement can be relaxed.  This option is particularly 
useful for nested
 virtualization, to allow the L1 hypervisor to use EPT even if the L0 hypervisor
 does not provide VM\_ENTRY\_LOAD\_GUEST\_PAT.
 
-### ept (Intel)
-> `= List of ( {no-}pml | {no-}ad )`
-
-Controls EPT related features.
-
-> Sub-options:
-
-> `pml`
-
-> Default: `true`
-
->> PML is a new hardware feature in Intel's Broadwell Server and further
->> platforms which reduces hypervisor overhead of log-dirty mechanism by
->> automatically recording GPAs (guest physical addresses) when guest memory
->> gets dirty, and therefore significantly reducing number of EPT violation
->> caused by write protection of guest memory, which is a necessity to
->> implement log-dirty mechanism before PML.
-
-> `ad`
-
-> Default: Hardware dependent
-
->> Have hardware keep accessed/dirty (A/D) bits updated.
-
 ### gdb
 > `= com1[H,L] | co

[Xen-devel] [PATCH 4/7] doc: escape underscores in xen-command-line.markdown

2018-04-24 Thread Juergen Gross
Some underscores are not escaped in xen-command-line.markdown.
Correct that.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 docs/misc/xen-command-line.markdown | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index ad08f3b645..713f9a6093 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -258,7 +258,7 @@ extreme care.**
 A (negative) boolean value can be specified to turn off all mitigations.
 (Use of a positive boolean value is invalid.)
 
-If Xen was compiled with INDIRECT_THUNK support, `thunk=` can be used to
+If Xen was compiled with INDIRECT\_THUNK support, `thunk=` can be used to
 select which of the thunks gets patched into the `__x86_indirect_thunk_%reg`
 locations.  The default thunk is `retpoline` (generally preferred for Intel
 hardware), with the alternatives being `jmp` (a `jmp *%reg` gadget, minimal
@@ -359,7 +359,7 @@ The accepted name keywords for name=value pairs are:
   Do note - these values are multiplied by 16.
 * `data-bits` - integer between 5 and 8
 * `dev` - accepted values are `pci` OR `amt`. If this option
-  is used to specify if the serial device is pci-based. The io_base
+  is used to specify if the serial device is pci-based. The io\_base
   cannot be specified when `dev=pci` or `dev=amt` is used.
 * `io-base` - accepts integer which specified IO base port for UART registers
 * `irq` - IRQ number to use
@@ -374,8 +374,8 @@ The accepted name keywords for name=value pairs are:
 The following are examples of correct specifications:
 
 com1=115200,8n1,0x3f8,4
-com1=115200,8n1,0x3f8,4,reg_width=4,reg_shift=2
-com1=baud=115200,parity=n,stop_bits=1,io_base=0x3f8,reg_width=4
+com1=115200,8n1,0x3f8,4,reg\_width=4,reg\_shift=2
+com1=baud=115200,parity=n,stop\_bits=1,io\_base=0x3f8,reg\_width=4
 
 ### conring\_size
 > `= `
@@ -1215,14 +1215,14 @@ established.
 > `= `
 
 ### irq\_vector\_map (x86)
-### ivrs_hpet[``] (AMD)
+### ivrs\_hpet[``] (AMD)
 > `=[:]:.`
 
 Force the use of `[:]:.` as device ID of HPET
 `` instead of the one specified by the IVHD sub-tables of the IVRS
 ACPI table.
 
-### ivrs_ioapic[``] (AMD)
+### ivrs\_ioapic[``] (AMD)
 > `=[:]:.`
 
 Force the use of `[:]:.` as device ID of IO-APIC
@@ -1257,13 +1257,13 @@ should be rate limited.
 ### low\_crashinfo
 > `= none | min | all`
 
-> Default: `none` if not specified at all, or to `min` if **low_crashinfo** is 
present without qualification.
+> Default: `none` if not specified at all, or to `min` if **low\_crashinfo** 
is present without qualification.
 
 This option is only useful for hosts with a 32bit dom0 kernel, wishing
 to use kexec functionality in the case of a crash.  It represents
 which data structures should be deliberately allocated in low memory,
 so the crash kernel may find find them.  Should be used in combination
-with **crashinfo_maxaddr**.
+with **crashinfo\_maxaddr**.
 
 ### low\_mem\_virq\_limit
 > `= `
@@ -1555,7 +1555,7 @@ Specify the host reboot method.
 
 `kbd` instructs Xen to reboot the host via the keyboard controller.
 
-`acpi` instructs Xen to reboot the host using RESET_REG in the ACPI FADT.
+`acpi` instructs Xen to reboot the host using RESET\_REG in the ACPI FADT.
 
 `pci` instructs Xen to reboot the host using PCI reset register (port CF9).
 
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 3/7] doc: add credit2_cap_period_ms boot parameter description

2018-04-24 Thread Juergen Gross
credit2_cap_period_ms isn't mentioned in xen-command-line.markdown.
Add a description.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 docs/misc/xen-command-line.markdown | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index a6b555d0b0..ad08f3b645 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -568,6 +568,16 @@ which would otherwise require escaping of the < option
 ### credit2\_balance\_under
 > `= `
 
+### credit2\_cap\_period\_ms
+> `= `
+
+> Default: `10`
+
+Domains subject to a cap receive a replenishment of their runtime budget
+once every cap period interval. Default is 10 ms. The amount of budget
+they receive depends on their cap. For instance, a domain with a 50% cap
+will receive 50% of 10 ms, so 5 ms.
+
 ### credit2\_load\_precision\_shift
 > `= `
 
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH 0/7] fix several issues documentation

2018-04-24 Thread Juergen Gross
docs/misc/xen-command-line.markdown has several issues regarding
the sequence of parameters, missing architecture qualifiers, missing
documentation of one parameter, and markdown syntax. Fix all of those.

Some other documents have syntax issues, too, which show up when
trying to do "make all" in the docs directory. Fix those as well.

In case the maintainers are fine with my changes I believe the series
should be included in 4.11. So for the series:

Release-acked-by: Juergen Gross <jgr...@suse.com>


Juergen Gross (7):
  doc: sort entries of boot parameters correctly
  doc: add architecture qualifier to boot parameter entries
  doc: add credit2_cap_period_ms boot parameter description
  doc: escape underscores in xen-command-line.markdown
  doc: correct livepatch.markdown syntax
  doc: correct feature-levelling.pandoc syntax
  doc: correct intel_psr_cat_cdp.pandoc syntax

 docs/features/feature-levelling.pandoc |   2 +-
 docs/features/intel_psr_cat_cdp.pandoc | 366 ++--
 docs/misc/livepatch.markdown   | 589 +++--
 docs/misc/xen-command-line.markdown| 404 +++---
 4 files changed, 655 insertions(+), 706 deletions(-)

-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/1] drm/xen-zcopy: Add Xen zero-copy helper DRM driver

2018-04-24 Thread Juergen Gross
On 24/04/18 07:43, Oleksandr Andrushchenko wrote:
> On 04/24/2018 01:41 AM, Boris Ostrovsky wrote:
>> On 04/23/2018 08:10 AM, Oleksandr Andrushchenko wrote:
>>> On 04/23/2018 02:52 PM, Wei Liu wrote:
 On Fri, Apr 20, 2018 at 02:25:20PM +0300, Oleksandr Andrushchenko
 wrote:
>>>   the gntdev.
>>>
>>> I think this is generic enough that it could be implemented by a
>>> device not tied to Xen. AFAICT the hyper_dma guys also wanted
>>> something similar to this.
>> You can't just wrap random userspace memory into a dma-buf. We've
>> just had
>> this discussion with kvm/qemu folks, who proposed just that, and
>> after a
>> bit of discussion they'll now try to have a driver which just wraps a
>> memfd into a dma-buf.
> So, we have to decide either we introduce a new driver
> (say, under drivers/xen/xen-dma-buf) or extend the existing
> gntdev/balloon to support dma-buf use-cases.
>
> Can anybody from Xen community express their preference here?
>
 Oleksandr talked to me on IRC about this, he said a few IOCTLs need to
 be added to either existing drivers or a new driver.

 I went through this thread twice and skimmed through the relevant
 documents, but I couldn't see any obvious pros and cons for either
 approach. So I don't really have an opinion on this.

 But, assuming if implemented in existing drivers, those IOCTLs need to
 be added to different drivers, which means userspace program needs to
 write more code and get more handles, it would be slightly better to
 implement a new driver from that perspective.
>>> If gntdev/balloon extension is still considered:
>>>
>>> All the IOCTLs will be in gntdev driver (in current xen-zcopy
>>> terminology):
>>>   - DRM_ICOTL_XEN_ZCOPY_DUMB_FROM_REFS
>>>   - DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
>>>   - DRM_IOCTL_XEN_ZCOPY_DUMB_WAIT_FREE
>>>
>>> Balloon driver extension, which is needed for contiguous/DMA
>>> buffers, will be to provide new *kernel API*, no UAPI is needed.
>>>
>>
>> So I am obviously a bit late to this thread, but why do you need to add
>> new ioctls to gntdev and balloon? Doesn't this driver manage to do what
>> you want without any extensions?
> 1. I only (may) need to add IOCTLs to gntdev
> 2. balloon driver needs to be extended, so it can allocate
> contiguous (DMA) memory, not IOCTLs/UAPI here, all lives
> in the kernel.
> 3. The reason I need to extend gnttab with new IOCTLs is to
> provide new functionality to create a dma-buf from grant references
> and to produce grant references for a dma-buf. This is what I have as UAPI
> description for xen-zcopy driver:
> 
> 1. DRM_IOCTL_XEN_ZCOPY_DUMB_FROM_REFS
> This will create a DRM dumb buffer from grant references provided
> by the frontend. The intended usage is:
>   - Frontend
>     - creates a dumb/display buffer and allocates memory
>     - grants foreign access to the buffer pages
>     - passes granted references to the backend
>   - Backend
>     - issues DRM_XEN_ZCOPY_DUMB_FROM_REFS ioctl to map
>   granted references and create a dumb buffer
>     - requests handle to fd conversion via DRM_IOCTL_PRIME_HANDLE_TO_FD
>     - requests real HW driver/consumer to import the PRIME buffer with
>   DRM_IOCTL_PRIME_FD_TO_HANDLE
>     - uses handle returned by the real HW driver
>   - at the end:
>     o closes real HW driver's handle with DRM_IOCTL_GEM_CLOSE
>     o closes zero-copy driver's handle with DRM_IOCTL_GEM_CLOSE
>     o closes file descriptor of the exported buffer
> 
> 2. DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
> This will grant references to a dumb/display buffer's memory provided by
> the
> backend. The intended usage is:
>   - Frontend
>     - requests backend to allocate dumb/display buffer and grant references
>   to its pages
>   - Backend
>     - requests real HW driver to create a dumb with
> DRM_IOCTL_MODE_CREATE_DUMB
>     - requests handle to fd conversion via DRM_IOCTL_PRIME_HANDLE_TO_FD
>     - requests zero-copy driver to import the PRIME buffer with
>   DRM_IOCTL_PRIME_FD_TO_HANDLE
>     - issues DRM_XEN_ZCOPY_DUMB_TO_REFS ioctl to
>   grant references to the buffer's memory.
>     - passes grant references to the frontend
>  - at the end:
>     - closes zero-copy driver's handle with DRM_IOCTL_GEM_CLOSE
>     - closes real HW driver's handle with DRM_IOCTL_GEM_CLOSE
>     - closes file descriptor of the imported buffer
> 
> 3. DRM_XEN_ZCOPY_DUMB_WAIT_FREE
> This will block until the dumb buffer with the wait handle provided be
> freed:
> this is needed for synchronization between frontend and backend in case
> frontend provides grant references of the buffer via
> DRM_XEN_ZCOPY_DUMB_FROM_REFS IOCTL and which must be released before
> backend replies with XENDISPL_OP_DBUF_DESTROY response.
> wait_handle must be the same value returned while calling
> DRM_XEN_ZCOPY_DUMB_FROM_REFS IOCTL.
> 
> So, as you can see the above functionality is not 

[Xen-devel] [PATCH 5/7] doc: correct livepatch.markdown syntax

2018-04-24 Thread Juergen Gross
"make -C docs all" fails due to incorrect markdown syntax in
livepatch.markdown. Correct it.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 docs/misc/livepatch.markdown | 589 ---
 1 file changed, 272 insertions(+), 317 deletions(-)

diff --git a/docs/misc/livepatch.markdown b/docs/misc/livepatch.markdown
index 54a6b850cb..f3c1320e5a 100644
--- a/docs/misc/livepatch.markdown
+++ b/docs/misc/livepatch.markdown
@@ -89,33 +89,26 @@ As example we will assume the hypervisor does not have 
XSA-132 (see
 4ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary patch
 the hypervisor with it. The original code looks as so:
 
-
-   48 89 e0  mov%rsp,%rax  
-   48 25 00 80 ff ff and$0x8000,%rax  
-
+   48 89 e0  mov%rsp,%rax  
+   48 25 00 80 ff ff and$0x8000,%rax  
 
 while the new patched hypervisor would be:
 
-
-   48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)  
-   48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)  
-   48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)  
-   48 89 e0  mov%rsp,%rax  
-   48 25 00 80 ff ff and$0x8000,%rax  
-
+   48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)  
+   48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)  
+   48 89 e0  mov%rsp,%rax  
+   48 25 00 80 ff ff and$0x8000,%rax  
 
-This is inside the arch_do_domctl. This new change adds 21 extra
+This is inside the arch\_do\_domctl. This new change adds 21 extra
 bytes of code which alters all the offsets inside the function. To alter
 these offsets and add the extra 21 bytes of code we might not have enough
 space in .text to squeeze this in.
 
 As such we could simplify this problem by only patching the site
-which calls arch_do_domctl:
+which calls arch\_do\_domctl:
 
-
-do_domctl:  
- e8 4b b1 05 00  callq  82d08015fbb9   
-
+do_domctl:  
+ e8 4b b1 05 00  callq  82d08015fbb9   
 
 with a new address for where the new `arch_do_domctl` would be (this
 area would be allocated dynamically).
@@ -123,15 +116,13 @@ area would be allocated dynamically).
 Astute readers will wonder what we need to do if we were to patch `do_domctl`
 - which is not called directly by hypervisor but on behalf of the guests via
 the `compat_hypercall_table` and `hypercall_table`.
-Patching the offset in `hypercall_table` for `do_domctl:
-(82d080103079 :)
+Patching the offset in `hypercall_table` for `do_domctl`:
+(82d080103079 do\_domctl:)
 
-
-
- 82d08024d490:   79 30  
- 82d08024d492:   10 80 d0 82 ff ff   
-
-
+
+ 82d08024d490:   79 30  
+ 82d08024d492:   10 80 d0 82 ff ff   
+
 
 with the new address where the new `do_domctl` is possible. The other
 place where it is used is in `hvm_hypercall64_table` which would need
@@ -164,19 +155,17 @@ CPU branching logic (I-cache, but it is just one 
unconditional jump).
 
 For this example we will assume that the hypervisor has not been compiled
 with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill structures
-for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
+for certain HYPERVISOR\_xen\_version sub-ops*) which mem-sets an structure
 in `xen_version` hypercall. This function is not called **anywhere** in
 the hypervisor (it is called by the guest) but referenced in the
 `compat_hypercall_table` and `hypercall_table` (and indirectly called
 from that). Patching the offset in `hypercall_table` for the old
-`do_xen_version` (82d080112f9e )
-
-
- 82d08024b270 :   
- ...  
- 82d08024b2f8:   9e 2f 11 80 d0 82 ff ff  
+`do_xen_version` (82d080112f9e do\_xen\_version)
 
-
+ 82d08024b270 :   
+ ...  
+ 82d08024b2f8:   9e 2f 11 80 d0 82 ff ff  
+
 
 with the new address where the new `do_xen_version` is possible. The other
 place where it is used is in `hvm_hypercall64_table` which would need
@@ -184,21 +173,17 @@ to be patched in a similar way. This would require an 
in-place splicing
 of the new virtual address of `do_xen_version`.
 
 An alternative solution would be to patch insert a trampoline in the
-old `do_xen_version' function to directly jump to the new `do_xen_version`.
+old `do_xen_version` function to directly jump to the new `do_xen_version`.
 
-
- 82d080112f9e do_xen_version:  
- 82d080112f9e:   48 c7 c0 da ff ff ffmov
$0xffda,%rax  
- 82d080112fa5:   83 ff 09cmp$0x9,%edi  
- 82d080112fa8:   0f 87 24 05 00 00   ja 82d0801134d2 ; 
do_xen_version+0x534  
-
+ 82d080112f9e do_xen_version:  
+ 82d080112f9e:   48 c7 c0 da ff ff ffmov
$0xffda,%rax  
+ 82d080112fa5:   83 ff 09cmp$0x9,%edi  
+ 82d080112fa8:   0f 87 24 0

[Xen-devel] [PATCH 6/7] doc: correct feature-levelling.pandoc syntax

2018-04-24 Thread Juergen Gross
"make -C docs all" fails due to incorrect markdown syntax in
feature-levelling.pandoc. Correct it.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 docs/features/feature-levelling.pandoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/features/feature-levelling.pandoc 
b/docs/features/feature-levelling.pandoc
index ef77eb837d..4b2b6df151 100644
--- a/docs/features/feature-levelling.pandoc
+++ b/docs/features/feature-levelling.pandoc
@@ -83,7 +83,7 @@ not trap, leaving Xen no direct ability to control the 
information returned.
 
 Xen-aware PV software can make use of the 'Forced Emulation Prefix'
 
-> `ud2a; .ascii 'xen'; cpuid`
+ud2a; .ascii 'xen'; cpuid
 
 which Xen recognises as a deliberate attempt to get the fully-controlled
 `CPUID` information rather than the hardware-reported information.  This only
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/2] To introduce xenwatch multithreading (xen mtwatch)

2018-04-24 Thread Juergen Gross
On 24/04/18 08:10, Dongli Zhang wrote:
> 
> 
> On 04/24/2018 02:03 PM, Juergen Gross wrote:
>> On 24/04/18 07:52, Dongli Zhang wrote:
>>> Hi Juergen,
>>>
>>> On 04/24/2018 01:22 PM, Juergen Gross wrote:
>>>> On 24/04/18 01:55, Dongli Zhang wrote:
>>>>> Hi Wei,
>>>>>
>>>>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>>>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>>>>>> About per-domU xenwatch thread create/destroy, a new type of xenstore 
>>>>>>> node is
>>>>>>> introduced: '/local/domain/0/mtwatch/'.
>>>>>>>
>>>>>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>>>>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the 
>>>>>>> insertion
>>>>>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>>>>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>>>>>
>>>>>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'.  
>>>>>>> Kernel
>>>>>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is 
>>>>>>> inserted,
>>>>>>> while this kernel thread is destroyed when the corresponding xenstore 
>>>>>>> node is
>>>>>>> removed.
>>>>>>
>>>>>> Instead of inventing yet another node, can you not watch /local/domain
>>>>>> directly?
>>>>>
>>>>> Would you like to watch at /local/domain directly? Or is your question 
>>>>> "is there
>>>>> any other way to not watch at /local/domain, while no extra xenstore node 
>>>>> will
>>>>> be introduced"?
>>>>>
>>>>> Actually, the first prototype of this idea was to watch at /local/domain
>>>>> directly to get aware of the domU create/destroy, so that xen toolstack 
>>>>> will not
>>>>> get involved. Joao Martins (CCed) had a concern on the performance as 
>>>>> watching
>>>>> at /local/domain would lead to large amount of xenwatch events.
>>>>
>>>> That's what the special watches "@introduceDomain" and "@releaseDomain"
>>>> are meant for.
>>>
>>> I used to consider to watch at "@introduceDomain". However, there is no 
>>> domain
>>> information appended with "@introduceDomain" and it is still required for 
>>> dom0
>>> kernel to proactively confirm who is created.
>>
>> That isn't too hard, right? You just need to read /local/domain to get
>> the list of its children and look for new domains there.
> 
> You are right. I will try to limit the modification within linux kernel, and 
> try
> to not dirty xen toolstack.
> 
> Thank you very much for the suggestion.

When going that route you should extend xenbus_directory() in the kernel
to use XS_DIRECTORY_PART in case XS_DIRECTORY returns E2BIG (see
xs_directory() in Xen's tools/xenstore/xs.c ). This will enable the
kernel to support more than about 1000 domains.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/1] drm/xen-zcopy: Add Xen zero-copy helper DRM driver

2018-04-24 Thread Juergen Gross
On 24/04/18 10:07, Oleksandr Andrushchenko wrote:
> On 04/24/2018 10:51 AM, Juergen Gross wrote:
>> On 24/04/18 07:43, Oleksandr Andrushchenko wrote:
>>> On 04/24/2018 01:41 AM, Boris Ostrovsky wrote:
>>>> On 04/23/2018 08:10 AM, Oleksandr Andrushchenko wrote:
>>>>> On 04/23/2018 02:52 PM, Wei Liu wrote:
>>>>>> On Fri, Apr 20, 2018 at 02:25:20PM +0300, Oleksandr Andrushchenko
>>>>>> wrote:
>>>>>>>>>    the gntdev.
>>>>>>>>>
>>>>>>>>> I think this is generic enough that it could be implemented by a
>>>>>>>>> device not tied to Xen. AFAICT the hyper_dma guys also wanted
>>>>>>>>> something similar to this.
>>>>>>>> You can't just wrap random userspace memory into a dma-buf. We've
>>>>>>>> just had
>>>>>>>> this discussion with kvm/qemu folks, who proposed just that, and
>>>>>>>> after a
>>>>>>>> bit of discussion they'll now try to have a driver which just
>>>>>>>> wraps a
>>>>>>>> memfd into a dma-buf.
>>>>>>> So, we have to decide either we introduce a new driver
>>>>>>> (say, under drivers/xen/xen-dma-buf) or extend the existing
>>>>>>> gntdev/balloon to support dma-buf use-cases.
>>>>>>>
>>>>>>> Can anybody from Xen community express their preference here?
>>>>>>>
>>>>>> Oleksandr talked to me on IRC about this, he said a few IOCTLs
>>>>>> need to
>>>>>> be added to either existing drivers or a new driver.
>>>>>>
>>>>>> I went through this thread twice and skimmed through the relevant
>>>>>> documents, but I couldn't see any obvious pros and cons for either
>>>>>> approach. So I don't really have an opinion on this.
>>>>>>
>>>>>> But, assuming if implemented in existing drivers, those IOCTLs
>>>>>> need to
>>>>>> be added to different drivers, which means userspace program needs to
>>>>>> write more code and get more handles, it would be slightly better to
>>>>>> implement a new driver from that perspective.
>>>>> If gntdev/balloon extension is still considered:
>>>>>
>>>>> All the IOCTLs will be in gntdev driver (in current xen-zcopy
>>>>> terminology):
>>>>>    - DRM_ICOTL_XEN_ZCOPY_DUMB_FROM_REFS
>>>>>    - DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
>>>>>    - DRM_IOCTL_XEN_ZCOPY_DUMB_WAIT_FREE
>>>>>
>>>>> Balloon driver extension, which is needed for contiguous/DMA
>>>>> buffers, will be to provide new *kernel API*, no UAPI is needed.
>>>>>
>>>> So I am obviously a bit late to this thread, but why do you need to add
>>>> new ioctls to gntdev and balloon? Doesn't this driver manage to do what
>>>> you want without any extensions?
>>> 1. I only (may) need to add IOCTLs to gntdev
>>> 2. balloon driver needs to be extended, so it can allocate
>>> contiguous (DMA) memory, not IOCTLs/UAPI here, all lives
>>> in the kernel.
>>> 3. The reason I need to extend gnttab with new IOCTLs is to
>>> provide new functionality to create a dma-buf from grant references
>>> and to produce grant references for a dma-buf. This is what I have as
>>> UAPI
>>> description for xen-zcopy driver:
>>>
>>> 1. DRM_IOCTL_XEN_ZCOPY_DUMB_FROM_REFS
>>> This will create a DRM dumb buffer from grant references provided
>>> by the frontend. The intended usage is:
>>>    - Frontend
>>>  - creates a dumb/display buffer and allocates memory
>>>  - grants foreign access to the buffer pages
>>>  - passes granted references to the backend
>>>    - Backend
>>>  - issues DRM_XEN_ZCOPY_DUMB_FROM_REFS ioctl to map
>>>    granted references and create a dumb buffer
>>>  - requests handle to fd conversion via DRM_IOCTL_PRIME_HANDLE_TO_FD
>>>  - requests real HW driver/consumer to import the PRIME buffer with
>>>    DRM_IOCTL_PRIME_FD_TO_HANDLE
>>>  - uses handle returned by the real HW driver
>>>    - at the end:
>>>  o closes real HW driver's handle with DRM_IOCTL_GEM_CLOSE
>>>  o closes zero-copy driver's handle with DRM_IOCTL_GEM_CLOSE
>

Re: [Xen-devel] [PATCH for-4.11] x86/spec_ctrl: Fix typo in ARCH_CAPS decode

2018-04-24 Thread Juergen Gross
On 24/04/18 11:44, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com>

Release-acked-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 0/1] drm/xen-zcopy: Add Xen zero-copy helper DRM driver

2018-04-24 Thread Juergen Gross
On 24/04/18 12:14, Oleksandr Andrushchenko wrote:
> On 04/24/2018 01:01 PM, Wei Liu wrote:
>> On Tue, Apr 24, 2018 at 11:08:41AM +0200, Juergen Gross wrote:
>>> On 24/04/18 11:03, Oleksandr Andrushchenko wrote:
>>>> On 04/24/2018 11:40 AM, Juergen Gross wrote:
>>>>> On 24/04/18 10:07, Oleksandr Andrushchenko wrote:
>>>>>> On 04/24/2018 10:51 AM, Juergen Gross wrote:
>>>>>>> On 24/04/18 07:43, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04/24/2018 01:41 AM, Boris Ostrovsky wrote:
>>>>>>>>> On 04/23/2018 08:10 AM, Oleksandr Andrushchenko wrote:
>>>>>>>>>> On 04/23/2018 02:52 PM, Wei Liu wrote:
>>>>>>>>>>> On Fri, Apr 20, 2018 at 02:25:20PM +0300, Oleksandr
>>>>>>>>>>> Andrushchenko
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>  the gntdev.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think this is generic enough that it could be
>>>>>>>>>>>>>> implemented by a
>>>>>>>>>>>>>> device not tied to Xen. AFAICT the hyper_dma guys also wanted
>>>>>>>>>>>>>> something similar to this.
>>>>>>>>>>>>> You can't just wrap random userspace memory into a dma-buf.
>>>>>>>>>>>>> We've
>>>>>>>>>>>>> just had
>>>>>>>>>>>>> this discussion with kvm/qemu folks, who proposed just
>>>>>>>>>>>>> that, and
>>>>>>>>>>>>> after a
>>>>>>>>>>>>> bit of discussion they'll now try to have a driver which just
>>>>>>>>>>>>> wraps a
>>>>>>>>>>>>> memfd into a dma-buf.
>>>>>>>>>>>> So, we have to decide either we introduce a new driver
>>>>>>>>>>>> (say, under drivers/xen/xen-dma-buf) or extend the existing
>>>>>>>>>>>> gntdev/balloon to support dma-buf use-cases.
>>>>>>>>>>>>
>>>>>>>>>>>> Can anybody from Xen community express their preference here?
>>>>>>>>>>>>
>>>>>>>>>>> Oleksandr talked to me on IRC about this, he said a few IOCTLs
>>>>>>>>>>> need to
>>>>>>>>>>> be added to either existing drivers or a new driver.
>>>>>>>>>>>
>>>>>>>>>>> I went through this thread twice and skimmed through the
>>>>>>>>>>> relevant
>>>>>>>>>>> documents, but I couldn't see any obvious pros and cons for
>>>>>>>>>>> either
>>>>>>>>>>> approach. So I don't really have an opinion on this.
>>>>>>>>>>>
>>>>>>>>>>> But, assuming if implemented in existing drivers, those IOCTLs
>>>>>>>>>>> need to
>>>>>>>>>>> be added to different drivers, which means userspace program
>>>>>>>>>>> needs to
>>>>>>>>>>> write more code and get more handles, it would be slightly
>>>>>>>>>>> better to
>>>>>>>>>>> implement a new driver from that perspective.
>>>>>>>>>> If gntdev/balloon extension is still considered:
>>>>>>>>>>
>>>>>>>>>> All the IOCTLs will be in gntdev driver (in current xen-zcopy
>>>>>>>>>> terminology):
>>>>>>>>>>  - DRM_ICOTL_XEN_ZCOPY_DUMB_FROM_REFS
>>>>>>>>>>  - DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
>>>>>>>>>>  - DRM_IOCTL_XEN_ZCOPY_DUMB_WAIT_FREE
>>>>>>>>>>
>>>>>>>>>> Balloon driver extension, which is needed for contiguous/DMA
>>>>>>>>>> buffers, will be to provide new *kernel API*, no UAPI is needed.
>>>>>>>>>>
>>>>>>>>> So I am obviously a bit late to this thread, but why do you need
>>>&g

Re: [Xen-devel] [PATCH 0/1] drm/xen-zcopy: Add Xen zero-copy helper DRM driver

2018-04-24 Thread Juergen Gross
On 24/04/18 11:03, Oleksandr Andrushchenko wrote:
> On 04/24/2018 11:40 AM, Juergen Gross wrote:
>> On 24/04/18 10:07, Oleksandr Andrushchenko wrote:
>>> On 04/24/2018 10:51 AM, Juergen Gross wrote:
>>>> On 24/04/18 07:43, Oleksandr Andrushchenko wrote:
>>>>> On 04/24/2018 01:41 AM, Boris Ostrovsky wrote:
>>>>>> On 04/23/2018 08:10 AM, Oleksandr Andrushchenko wrote:
>>>>>>> On 04/23/2018 02:52 PM, Wei Liu wrote:
>>>>>>>> On Fri, Apr 20, 2018 at 02:25:20PM +0300, Oleksandr Andrushchenko
>>>>>>>> wrote:
>>>>>>>>>>>     the gntdev.
>>>>>>>>>>>
>>>>>>>>>>> I think this is generic enough that it could be implemented by a
>>>>>>>>>>> device not tied to Xen. AFAICT the hyper_dma guys also wanted
>>>>>>>>>>> something similar to this.
>>>>>>>>>> You can't just wrap random userspace memory into a dma-buf. We've
>>>>>>>>>> just had
>>>>>>>>>> this discussion with kvm/qemu folks, who proposed just that, and
>>>>>>>>>> after a
>>>>>>>>>> bit of discussion they'll now try to have a driver which just
>>>>>>>>>> wraps a
>>>>>>>>>> memfd into a dma-buf.
>>>>>>>>> So, we have to decide either we introduce a new driver
>>>>>>>>> (say, under drivers/xen/xen-dma-buf) or extend the existing
>>>>>>>>> gntdev/balloon to support dma-buf use-cases.
>>>>>>>>>
>>>>>>>>> Can anybody from Xen community express their preference here?
>>>>>>>>>
>>>>>>>> Oleksandr talked to me on IRC about this, he said a few IOCTLs
>>>>>>>> need to
>>>>>>>> be added to either existing drivers or a new driver.
>>>>>>>>
>>>>>>>> I went through this thread twice and skimmed through the relevant
>>>>>>>> documents, but I couldn't see any obvious pros and cons for either
>>>>>>>> approach. So I don't really have an opinion on this.
>>>>>>>>
>>>>>>>> But, assuming if implemented in existing drivers, those IOCTLs
>>>>>>>> need to
>>>>>>>> be added to different drivers, which means userspace program
>>>>>>>> needs to
>>>>>>>> write more code and get more handles, it would be slightly
>>>>>>>> better to
>>>>>>>> implement a new driver from that perspective.
>>>>>>> If gntdev/balloon extension is still considered:
>>>>>>>
>>>>>>> All the IOCTLs will be in gntdev driver (in current xen-zcopy
>>>>>>> terminology):
>>>>>>> - DRM_ICOTL_XEN_ZCOPY_DUMB_FROM_REFS
>>>>>>> - DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
>>>>>>> - DRM_IOCTL_XEN_ZCOPY_DUMB_WAIT_FREE
>>>>>>>
>>>>>>> Balloon driver extension, which is needed for contiguous/DMA
>>>>>>> buffers, will be to provide new *kernel API*, no UAPI is needed.
>>>>>>>
>>>>>> So I am obviously a bit late to this thread, but why do you need
>>>>>> to add
>>>>>> new ioctls to gntdev and balloon? Doesn't this driver manage to do
>>>>>> what
>>>>>> you want without any extensions?
>>>>> 1. I only (may) need to add IOCTLs to gntdev
>>>>> 2. balloon driver needs to be extended, so it can allocate
>>>>> contiguous (DMA) memory, not IOCTLs/UAPI here, all lives
>>>>> in the kernel.
>>>>> 3. The reason I need to extend gnttab with new IOCTLs is to
>>>>> provide new functionality to create a dma-buf from grant references
>>>>> and to produce grant references for a dma-buf. This is what I have as
>>>>> UAPI
>>>>> description for xen-zcopy driver:
>>>>>
>>>>> 1. DRM_IOCTL_XEN_ZCOPY_DUMB_FROM_REFS
>>>>> This will create a DRM dumb buffer from grant references provided
>>>>> by the frontend. The intended usage is:
>>>>>     - Frontend
>>>>>   - creates a dumb/display 

Re: [Xen-devel] [RFC 0/2] To introduce xenwatch multithreading (xen mtwatch)

2018-04-23 Thread Juergen Gross
On 24/04/18 01:55, Dongli Zhang wrote:
> Hi Wei,
> 
> On 04/23/2018 10:09 PM, Wei Liu wrote:
>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node 
>>> is
>>> introduced: '/local/domain/0/mtwatch/'.
>>>
>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the 
>>> insertion
>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>
>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'.  
>>> Kernel
>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is 
>>> inserted,
>>> while this kernel thread is destroyed when the corresponding xenstore node 
>>> is
>>> removed.
>>
>> Instead of inventing yet another node, can you not watch /local/domain
>> directly?
> 
> Would you like to watch at /local/domain directly? Or is your question "is 
> there
> any other way to not watch at /local/domain, while no extra xenstore node will
> be introduced"?
> 
> Actually, the first prototype of this idea was to watch at /local/domain
> directly to get aware of the domU create/destroy, so that xen toolstack will 
> not
> get involved. Joao Martins (CCed) had a concern on the performance as watching
> at /local/domain would lead to large amount of xenwatch events.

That's what the special watches "@introduceDomain" and "@releaseDomain"
are meant for.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v8 1/9] x86/xpti: avoid copying L4 page table contents when possible

2018-04-24 Thread Juergen Gross
On 24/04/18 12:31, Tim Deegan wrote:
> At 07:45 +0200 on 23 Apr (1524469545), Juergen Gross wrote:
>> On 22/04/18 18:39, Tim Deegan wrote:
>>> At 19:11 +0200 on 21 Apr (1524337893), Juergen Gross wrote:
>>>> On 21/04/18 15:32, Tim Deegan wrote:
>>>>> At 09:44 +0200 on 19 Apr (1524131080), Juergen Gross wrote:
>>>>>> Another alternative would be to pass another flag to the callers to
>>>>>> signal the need for a flush. This would require quite some modifications
>>>>>> to shadow code I'd like to avoid, though. OTOH this way we could combine
>>>>>> flushing the tlb and the root page tables. Tim, any preferences?
>>>>>
>>>>> This sounds a promising direction but it should be doabl without major
>>>>> surgery to the shadow code.  The shadow code already leaves old sl4es
>>>>> visible (in TLBs) when it's safe to do so, so I think the right place
>>>>> to hook this is on the receiving side of the TLB flush IPI.  IOW as
>>>>> long as:
>>>>>  - you copy the L4 on context switch; and
>>>>>  - you copy it on the TLB flush IPI is received
>>>>> then you can rely on the existing TLB flush mechanisms to do what you 
>>>>> need.
>>>>> And shadow doesn't have to behave differently from 'normal' PV MM.
>>>>
>>>> It is not so easy. The problem is that e.g. a page fault will flush the
>>>> TLB entry for the page in question, but it won't lead to the L4 to be
>>>> copied.
>>>
>>> Oh yes, I see; thanks for the explanation.  It might be worth copying
>>> what the hardware does here, and checking/propagating the relevant l4e
>>> in the PV pagefault handler.
>>
>> While in the long run being an interesting option I'm not sure I want
>> to go this route for 4.11. There might be nasty corner cases and I think
>> such a lazy approach is much more error prone than doing explicit
>> updates of the L4 table on the affected cpus in case of a modified
>> entry. I think we should either do the explicit call of flush_mask() in
>> shadow_set_l4e() or propagate the need for the flush up to the caller.
> 
> FWIW, I disagree -- I think that having the fault handler DTRT and
> relying on the existing, tested, TLB-flush logic is more likely to
> work than introducing a new mechanism that _also_ has to catch every
> possible l4e update.  It should touch less code and be less likely to
> break with later changes.  And I think it would be better to do it
> 'properly' now than to hope there's time to revisit it later.  That
> said, if Jan agrees that this way is OK, I'll quit grumbling and
> review the shadow parts. :)

Okay, thanks.

> I think that setting the bits in shadow_set_l4e() is better than
> having this leak out into all the callers.  I'm happy to see that the
> hunk in l4e_propagate_from_guest() has gone away too.
> 
> Please move the shadow_set_l4e() hunk up so it's just after the write,
> and before the general TLB flush logic.

Okay.

> Please move the logic into your code: the new function should take a
> domain pointer and do all the filtering itself rather than have shadow
> code be aware of what xpti is or why the domain's dirty-cpumask is
> relevant.

Okay.

> It doesn't look like there's any check limiting this to PV guests, and
> I think there should be, right?

In my newest version it already is testing that.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Fwd: [tip:x86/urgent] x86, sched: Allow topologies where NUMA nodes share an LLC

2018-04-17 Thread Juergen Gross
Is this something we should be aware of in Xen, too?


Juergen


 Forwarded Message 
Subject: [tip:x86/urgent] x86,sched: Allow topologies where NUMA nodes
share an LLC
Date: Tue, 17 Apr 2018 06:46:12 -0700
From: tip-bot for Alison Schofield 
Reply-To: h...@zytor.com, b...@suse.de, tony.l...@intel.com,
tim.c.c...@linux.intel.com, b...@alien8.de, h...@linux.intel.com,
dave.han...@linux.intel.com, imamm...@redhat.com, pra...@redhat.com,
linux-ker...@vger.kernel.org, pet...@infradead.org,
alison.schofi...@intel.com, t...@linutronix.de, mi...@kernel.org,
rient...@google.com
To: linux-tip-comm...@vger.kernel.org
CC: pet...@infradead.org, linux-ker...@vger.kernel.org,
pra...@redhat.com, imamm...@redhat.com, rient...@google.com,
mi...@kernel.org, alison.schofi...@intel.com, t...@linutronix.de,
tim.c.c...@linux.intel.com, tony.l...@intel.com, b...@suse.de,
h...@zytor.com, dave.han...@linux.intel.com, h...@linux.intel.com,
b...@alien8.de

Commit-ID:  1340ccfa9a9afefdbab90d7935d4ed19817e37c2
Gitweb:
https://git.kernel.org/tip/1340ccfa9a9afefdbab90d7935d4ed19817e37c2
Author: Alison Schofield 
AuthorDate: Fri, 6 Apr 2018 17:21:30 -0700
Committer:  Thomas Gleixner 
CommitDate: Tue, 17 Apr 2018 15:39:55 +0200

x86,sched: Allow topologies where NUMA nodes share an LLC

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:

Remote socket: entire package LLC is shared
Local socket->local slice: data goes into local slice LLC
Local socket->remote slice: data goes into remote-slice LLC. Slightly
higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.

The current code emits the following warning:

 sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 !=
0]. Ignoring dependency.

The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.

To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.

Signed-off-by: Alison Schofield 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Borislav Petkov 
Cc: Prarit Bhargava 
Cc: Tony Luck 
Cc: Peter Zijlstra (Intel) 
Cc: brice.gog...@gmail.com
Cc: Dave Hansen 
Cc: Borislav Petkov 
Cc: David Rientjes 
Cc: Igor Mammedov 
Cc: "H. Peter Anvin" 
Cc: Tim Chen 
Link:
https://lkml.kernel.org/r/20180407002130.ga18...@alison-desk.jf.intel.com

---
 arch/x86/kernel/smpboot.c | 45
-
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ff99e2b6fc54..45175b81dd5b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -77,6 +77,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
  /* Number of siblings per CPU package */
 int smp_num_siblings = 1;
@@ -390,15 +392,47 @@ static 

Re: [Xen-devel] [PATCH] x86/msr: further correct the emulation behaviour of MSR_PRED_CMD

2018-04-17 Thread Juergen Gross
On 17/04/18 13:41, Jan Beulich wrote:
> Following commit a6aa678fa3 ("x86/msr: Correct the emulation behaviour
> of MSR_PRED_CMD") we may end up writing the low bit with the wrong
> value. While it's unlikely for a guest to want to write zero there, we
> should still permit (this without incurring the overhead of an actual
> barrier). Correcting this right away will also help whenever further
> bits in the MSR might become defined.
> 
> Signed-off-by: Jan Beulich <jbeul...@suse.com>

Release-acked-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH for-4.11] libs/gnttab: fix FreeBSD gntdev interface

2018-04-17 Thread Juergen Gross
On 17/04/18 15:03, Roger Pau Monne wrote:
> Current interface to the gntdev in FreeBSD is wrong, and mostly worked
> out of luck before the PTI FreeBSD fixes, when kernel and user-space
> where sharing the same page tables.
> 
> On FreeBSD ioctls have the size of the passed struct encoded in the ioctl
> number, because the generic ioctl handler in the OS takes care of
> copying the data from user-space to kernel space, and then calls the
> device specific ioctl handler. Thus using ioctl structs with variable
> sizes is not possible.
> 
> The fix is to turn the array of structs at the end of
> ioctl_gntdev_alloc_gref and ioctl_gntdev_map_grant_ref into pointers,
> that can be properly accessed from the kernel gntdev driver using the
> copyin/copyout functions. Note that this is exactly how it's done for
> the privcmd driver.
> 
> Signed-off-by: Roger Pau Monné <roger@citrix.com>

Release-acked-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] docs/support-matrix-generate: use `git log' not `git-log'

2018-04-25 Thread Juergen Gross
On 25/04/18 17:56, Ian Jackson wrote:
> I found this bug when trying to set up the cron job.
> 
> Signed-off-by: Ian Jackson <ian.jack...@eu.citrix.com>

Release-acked-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH for-4.10 0/9] SUPPORT.md backports to support matrix generation [and 4 more messages]

2018-04-25 Thread Juergen Gross
On 25/04/18 17:58, Ian Jackson wrote:
> Ian Jackson writes ("Re: [PATCH for-4.10 0/9] SUPPORT.md backports to support 
> matrix generation [and 4 more messages]"):
>> Thanks everyone.  I have pushed this to staging-4.10, including the
>> patch 10/9.
> 
> The first cut of the cron job is now running.  So far it is only
> running off staging, and the output is here:
> 
>   http://xenbits.xen.org/docs/unstable-staging/support-matrix.html
> 
> It's labelled DRAFT because it's from staging, not because it's
> hand-edited or anything.
> 
> When we get pushes of all the relevant branches to the corresponding
> stable branches, I think the non-DRAFT version is ready to go.
> 
> Ian.
> 

The links to the 4.10 SUPPORT.html aren't working:

https://xenbits.xen.org/docs/4.10-testing/SUPPORT.html


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH-for-4.10 V2] adapt SUPPORT.md to match 4.11

2018-04-25 Thread Juergen Gross
Some tags have been changed in 4.11. Adapt the 4.10 ones to match in
order to produce an easier to read support HTML table.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 SUPPORT.md | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/SUPPORT.md b/SUPPORT.md
index f85dda8933..96002ea661 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -78,9 +78,9 @@ Fully virtualised guest using hardware virtualisation 
extensions
 
 Requires hardware virtualisation support (Intel VMX / AMD SVM)
 
-Status: Supported
+Status, domU: Supported
 
-### x86/PVH guest
+### x86/PVH
 
 PVH is a next-generation paravirtualized mode
 designed to take advantage of hardware virtualization support when possible.
@@ -88,9 +88,9 @@ During development this was sometimes called HVMLite or PVHv2.
 
 Requires hardware virtualisation support (Intel VMX / AMD SVM)
 
-Status: Supported
+Status, domU: Supported
 
-### ARM guest
+### ARM
 
 ARM only has one guest type at the moment
 
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v4 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context

2018-03-27 Thread Juergen Gross
When switching to a 64-bit pv context the TLB is flushed twice today:
the first time when switching to the new address space in
write_ptbase(), the second time when switching to guest mode in
restore_to_guest.

Avoid the first TLB flush in that case.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- omit setting root_pgt_changed to false (Jan Beulich)
---
 xen/arch/x86/mm.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b1d3f0eda8..f7d24a1f8b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -123,6 +123,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -503,9 +504,13 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-if ( this_cpu(root_pgt) )
+if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+{
 get_cpu_info()->root_pgt_changed = true;
-write_cr3(v->arch.cr3);
+asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+}
+else
+write_cr3(v->arch.cr3);
 }
 
 /*
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v4 0/7] xen/x86: various XPTI speedups

2018-03-27 Thread Juergen Gross
This patch series aims at reducing the overhead of the XPTI Meltdown
mitigation. It is based on Jan's XPTI speedup series.

Patch 1 had been posted before, the main changes in this patch are due
to addressing Jan's comments on my first version. The main objective of
that patch is to avoid copying the L4 page table each time the guest is
being activated, as often the contents didn't change while the
hypervisor was active.

Patch 2 tries to minimize flushing the TLB: there is no need to flush
it in write_ptbase() and when activating the guest.

Patch 3 sets the stage for being able to activate XPTI per domain. As a
first step it is now possible to switch XPTI off for dom0 via the xpti
boot parameter.

Patch 4 adds support for using the INVPCID instruction for flushing
the TLB.

Patch 5 reduces the costs of TLB flushes even further: as we don't make
any use of global TLB entries with XPTI being active we can avoid
removing all global TLB entries on TLB flushes by simply deactivating
the global pages in CR4.

Patch 6 was originally only meant to prepare using PCIDs in patch 6.
For that purpose it was necessary to allow CR3 values with bit 63 set
in order to avoid flushing TLB entries when writing CR3. This requires
a modification of Jan's rather clever state machine with positive and
negative CR3 values for the hypervisor by using a dedicated flag byte
instead. It turned out this modification saved one branch on interrupt
entry speeding up the handling by a few percent.

Patch 7 is the main performance contributor: by making use of the PCID
feature (if available) TLB entries can survive CR3 switches. The TLB
needs to be flushed on context switches only and not when switching
between guest and hypervisor or guest kernel and user mode.

On my machine (Intel i7-4600M) using the PCID feature in the non-XPTI
case showed a slightly worse performance than using global pages
instead (using PCID and global pages is a bad idea as invalidating
global pages in this case would need a complete TLB flush). For this
reason I've decided to use PCID for XPTI only as the default. That
can easily be changed by using the command line parameter "pcid=all".

The complete series has been verified to still mitigate against
Meltdown attacks. A simple performance test (make -j 4 in the Xen
hypervisor directory) showed significant improvements compared to the
state without this series (so with Jan's series applied),
the percentage after the numbers is always related to XPTI off:

   XPTI off Jan, XPTI on+this series, XPTI on
real   1m21.169s1m52.149s (+38%)1m25.692s (+6%)
user   2m47.652s2m50.054s (+1%) 2m46.428s (-1%)
sys1m11.949s2m21.767s (+97%)1m23.053s (+15%)


Juergen Gross (7):
  x86/xpti: avoid copying L4 page table contents when possible
  x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  xen/x86: support per-domain flag for xpti
  xen/x86: use invpcid for flushing the TLB
  xen/x86: disable global pages for domains with XPTI active
  xen/x86: use flag byte for decision whether xen_cr3 is valid
  xen/x86: use PCID feature

 docs/misc/xen-command-line.markdown |  32 +++-
 xen/arch/x86/cpu/mtrr/generic.c |  37 ++---
 xen/arch/x86/debug.c|   2 +-
 xen/arch/x86/domain.c   |   6 +-
 xen/arch/x86/domain_page.c  |   2 +-
 xen/arch/x86/domctl.c   |   8 ++
 xen/arch/x86/flushtlb.c |  91 -
 xen/arch/x86/mm.c   |  84 
 xen/arch/x86/pv/dom0_build.c|   4 +
 xen/arch/x86/pv/domain.c| 154 +++-
 xen/arch/x86/setup.c|  27 +++
 xen/arch/x86/smp.c  |   2 +-
 xen/arch/x86/smpboot.c  |   6 +-
 xen/arch/x86/x86_64/asm-offsets.c   |   2 +
 xen/arch/x86/x86_64/compat/entry.S  |   5 +-
 xen/arch/x86/x86_64/entry.S |  87 +---
 xen/common/efi/runtime.c|   4 +-
 xen/include/asm-x86/current.h   |  23 --
 xen/include/asm-x86/domain.h|  17 ++--
 xen/include/asm-x86/flushtlb.h  |   4 +-
 xen/include/asm-x86/processor.h |   3 +
 xen/include/asm-x86/pv/domain.h |  24 ++
 xen/include/asm-x86/x86-defns.h |   4 +-
 23 files changed, 483 insertions(+), 145 deletions(-)

-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v4 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-27 Thread Juergen Gross
Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to write_cr3_cr4() called by write_ptbase()).

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V4:
- don't use mmu_cr4_features for setting new cr4 value (Jan Beulich)
- use simpler scheme for setting X86_CR4_PGE in
  pv_guest_cr4_to_real_cr4() (Jan Beulich)

V3:
- move cr4 loading for all domains from *_ctxt_switch_to() to
  write_cr3_cr4() called by write_ptbase() (Jan Beulich)
- rebase
---
 xen/arch/x86/domain.c  |  5 -
 xen/arch/x86/flushtlb.c| 13 -
 xen/arch/x86/mm.c  | 14 +++---
 xen/arch/x86/x86_64/entry.S| 10 --
 xen/common/efi/runtime.c   |  4 ++--
 xen/include/asm-x86/domain.h   |  3 ++-
 xen/include/asm-x86/flushtlb.h |  2 +-
 7 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index fbb320da9c..38c61dc13a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1522,17 +1522,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
 void paravirt_ctxt_switch_to(struct vcpu *v)
 {
 root_pgentry_t *root_pgt = this_cpu(root_pgt);
-unsigned long cr4;
 
 if ( root_pgt )
 root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
 l4e_from_page(v->domain->arch.perdomain_l3_pg,
   __PAGE_HYPERVISOR_RW);
 
-cr4 = pv_guest_cr4_to_real_cr4(v);
-if ( unlikely(cr4 != read_cr4()) )
-write_cr4(cr4);
-
 if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
 activate_debugregs(v);
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index e740520a8b..5cb0ad97b8 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -91,20 +91,23 @@ static void do_tlb_flush(void)
 post_flush(t);
 }
 
-void write_cr3(unsigned long cr3)
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
 {
-unsigned long flags, cr4;
+unsigned long flags;
 u32 t;
 
 /* This non-reentrant function is sometimes called in interrupt context. */
 local_irq_save(flags);
 
 t = pre_flush();
-cr4 = read_cr4();
 
-write_cr4(cr4 & ~X86_CR4_PGE);
+if ( read_cr4() & X86_CR4_PGE )
+write_cr4(cr4 & ~X86_CR4_PGE);
+
 asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
-write_cr4(cr4);
+
+if ( read_cr4() != cr4 )
+write_cr4(cr4);
 
 post_flush(t);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 008dcc1749..4f637a3c3c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,20 +505,28 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 void write_ptbase(struct vcpu *v)
 {
 struct cpu_info *cpu_info = get_cpu_info();
+unsigned long new_cr4;
+
+new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
+  ? pv_guest_cr4_to_real_cr4(v)
+  : ((read_cr4() & ~X86_CR4_TSD) | X86_CR4_PGE);
 
 if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
 {
 cpu_info->root_pgt_changed = true;
 cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
-asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+write_cr3_cr4(v->arch.cr3, new_cr4);
 }
 else
 {
-/* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+/* Make sure to clear xen_cr3 before pv_cr3. */
 cpu_info->xen_cr3 = 0;
-write_cr3(v->arch.cr3);
+/* write_cr3_cr4() serializes. */
+write_cr3_cr4(v->arch.cr3, new_cr4);
 cpu_info->pv_cr3 = 0;
 }
+
+ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 2a06cd1a51..eb33ec835a 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -154,13 +154,8 @@ restore_all_guest:
 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
 rep movsq
 .Lrag_copy_done:
-mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
 mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
-mov   %rdi, %rsi
-and   $~X86_CR4_PGE, %rdi
-mov   %rdi, %cr4
 mov   %rax, %cr3
-mov   %rsi, %cr4
 .Lrag_cr3_end:
 ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
 
@@ -218,12 +213,7 @@ restore_all_xen:
  * so "g" will have to do.
  */
 UNLIKELY_START(g, exit_cr3)
-mov   %cr4, %rdi
-m

[Xen-devel] [PATCH v4 3/7] xen/x86: support per-domain flag for xpti

2018-03-27 Thread Juergen Gross
Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Jan Beulich <jbeul...@suse.com>
---
V3:
- latch get_cpu_info() return value in variable (Jan Beulich)
- call always xpti_domain_init() for pv dom0 (Jan Beulich)
- add __init annotations (Jan Beulich)
- drop per domain XPTI message (Jan Beulich)
- document xpti=default support (Jan Beulich)
- move domain xpti flag into a padding hole (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 10 -
 xen/arch/x86/domctl.c   |  4 ++
 xen/arch/x86/mm.c   | 12 +-
 xen/arch/x86/pv/dom0_build.c|  3 ++
 xen/arch/x86/pv/domain.c| 79 -
 xen/arch/x86/setup.c| 20 +-
 xen/arch/x86/smpboot.c  |  4 +-
 xen/arch/x86/x86_64/entry.S |  2 +
 xen/include/asm-x86/current.h   |  3 +-
 xen/include/asm-x86/domain.h|  3 ++
 xen/include/asm-x86/pv/domain.h |  4 ++
 11 files changed, 119 insertions(+), 25 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index b353352adf..79be9a6ba5 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1955,7 +1955,7 @@ clustered mode.  The default, given no hint from the 
**FADT**, is cluster
 mode.
 
 ### xpti
-> `= `
+> `= default | nodom0 | `
 
 > Default: `false` on AMD hardware
 > Default: `true` everywhere else
@@ -1963,6 +1963,14 @@ mode.
 Override default selection of whether to isolate 64-bit PV guest page
 tables.
 
+`true` activates page table isolation even on AMD hardware.
+
+`false` deactivates page table isolation on all systems.
+
+`default` sets the default behaviour.
+
+`nodom0` deactivates page table isolation for dom0.
+
 ### xsave
 > `= `
 
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 8fbbf3aeb3..0704f398c7 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include  /* for hvm_acpi_power_button */
 #include  /* for arch_do_domctl */
 #include 
@@ -610,6 +611,9 @@ long arch_do_domctl(
 ret = switch_compat(d);
 else
 ret = -EINVAL;
+
+if ( ret == 0 )
+xpti_domain_init(d);
 break;
 
 case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f7d24a1f8b..008dcc1749 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -504,13 +504,21 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+struct cpu_info *cpu_info = get_cpu_info();
+
+if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
 {
-get_cpu_info()->root_pgt_changed = true;
+cpu_info->root_pgt_changed = true;
+cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
 asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
 }
 else
+{
+/* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+cpu_info->xen_cr3 = 0;
 write_cr3(v->arch.cr3);
+cpu_info->pv_cr3 = 0;
+}
 }
 
 /*
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 0bd2f1bf90..77186c19bd 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -707,6 +708,8 @@ int __init dom0_construct_pv(struct domain *d,
 cpu = p->processor;
 }
 
+xpti_domain_init(d);
+
 d->arch.paging.mode = 0;
 
 /* Set up CR3 value for write_ptbase */
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index fd529332a3..e6bb2bac76 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -9,6 +9,8 @@
 #include 
 #include 
 
+#include 
+#include 
 #include 
 
 /* Override macros from asm/page.h to make them work with mfn_t */
@@ -17,6 +19,81 @@
 #undef page_to_mfn
 #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
 
+static __read_mostly enum {
+XPTI_DEFAULT,
+XPTI_ON,
+XPTI_OFF,
+XPTI_NODOM0
+} opt_xpti = XPTI_DEFAULT;
+
+static __init int parse_xpti(const char *s)
+{
+int rc = 0;
+
+switch ( parse_bool(s, NULL) )
+{
+case 0:
+opt_xpti = XPTI_OFF;
+break;
+case 1:
+opt_xpti = XPTI_ON;
+break;
+default:
+if ( !strcmp(s, "default") )
+opt_xpti = XPTI_DEFAULT;

[Xen-devel] [PATCH v4 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid

2018-03-27 Thread Juergen Gross
Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Jan Beulich <jbeul...@suse.com>
---
V3:
- renamed use_xen_cr3 to better fitting use_pv_cr3
- corrected comment regarding semantics of use_pv_cr3 (Jan Beulich)
- prefer 32-bit operations over 8- or 16-bit ones (Jan Beulich)
---
 xen/arch/x86/domain.c  |  1 +
 xen/arch/x86/mm.c  |  3 +-
 xen/arch/x86/smpboot.c |  2 ++
 xen/arch/x86/x86_64/asm-offsets.c  |  1 +
 xen/arch/x86/x86_64/compat/entry.S |  5 ++-
 xen/arch/x86/x86_64/entry.S| 67 +-
 xen/include/asm-x86/current.h  | 12 ---
 7 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 38c61dc13a..57ff40bad8 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1695,6 +1695,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
 ASSERT(local_irq_is_enabled());
 
+get_cpu_info()->use_pv_cr3 = false;
 get_cpu_info()->xen_cr3 = 0;
 
 if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 4f637a3c3c..856eb9e67f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -519,7 +519,8 @@ void write_ptbase(struct vcpu *v)
 }
 else
 {
-/* Make sure to clear xen_cr3 before pv_cr3. */
+/* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */
+cpu_info->use_pv_cr3 = false;
 cpu_info->xen_cr3 = 0;
 /* write_cr3_cr4() serializes. */
 write_cr3_cr4(v->arch.cr3, new_cr4);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 346a8e8a3f..05109a98fa 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -330,6 +330,7 @@ void start_secondary(void *unused)
  */
 spin_debug_disable();
 
+get_cpu_info()->use_pv_cr3 = false;
 get_cpu_info()->xen_cr3 = 0;
 get_cpu_info()->pv_cr3 = 0;
 
@@ -1129,6 +1130,7 @@ void __init smp_prepare_boot_cpu(void)
 per_cpu(scratch_cpumask, cpu) = _cpu0mask;
 #endif
 
+get_cpu_info()->use_pv_cr3 = false;
 get_cpu_info()->xen_cr3 = 0;
 get_cpu_info()->pv_cr3 = 0;
 }
diff --git a/xen/arch/x86/x86_64/asm-offsets.c 
b/xen/arch/x86/x86_64/asm-offsets.c
index 9e2aefb00f..7ad024cf37 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -144,6 +144,7 @@ void __dummy__(void)
 OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, 
use_shadow_spec_ctrl);
 OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
 OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
+OFFSET(CPUINFO_use_pv_cr3, struct cpu_info, use_pv_cr3);
 DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
 BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S 
b/xen/arch/x86/x86_64/compat/entry.S
index 6d2a14eacf..a18598f106 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -211,10 +211,9 @@ ENTRY(cstar_enter)
 GET_STACK_END(bx)
 .Lcstar_cr3_start:
 mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-neg   %rcx
+test  %rcx, %rcx
 jz.Lcstar_cr3_okay
-mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-neg   %rcx
+movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
 mov   %rcx, %cr3
 /* %r12 is still zero at this point. */
 mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index eb33ec835a..f51d091c23 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -155,6 +155,7 @@ restore_all_guest:
 rep movsq
 .Lrag_copy_done:
 mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+movb  $1, STACK_CPUINFO_FIELD(use_pv_cr3)(%rdx)
 mov   %rax, %cr3
 .Lrag_cr3_end:
 ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
@@ -205,14 +206,9 @@ restore_all_xen:
  */
 GET_STACK_END(bx)
 .Lrax_cr3_start:
-mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+cmpb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
+UNLIKELY_START(ne, exit_cr3)
 mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
-test  %rdx, %rdx
-/*
- * Ideally the condition would be "nsz

[Xen-devel] [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible

2018-03-27 Thread Juergen Gross
For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag whether the copying should be performed and set
that flag only when loading a new %cr3 or modifying the L4 page table.
This includes synchronization of the cpu local root page table with
other cpus, so add a special synchronization flag for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V4:
- move setting of root_pgt_changed flag in flush_area_local() out of
  irq disabled section (Jan Beulich)
- move setting of root_pgt_changed in make_cr3() to _toggle_guest_pt()
  (Jan Beulich)
- remove most conditionals in write_ptbase() (Jan Beulich)
- don't set root_pgt_changed in do_mmu_update() for modification of
  the user page table (Jan Beulich)

V3:
- set flag locally only if affected L4 is active (Jan Beulich)
- add setting flag to flush_area_mask() (Jan Beulich)
- set flag in make_cr3() only if called for current active vcpu

To be applied on top of Jan's "Meltdown band-aid overhead reduction"
series
---
 xen/arch/x86/flushtlb.c   |  3 +++
 xen/arch/x86/mm.c | 38 +-
 xen/arch/x86/pv/domain.c  |  1 +
 xen/arch/x86/smp.c|  2 +-
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/arch/x86/x86_64/entry.S   |  8 ++--
 xen/include/asm-x86/current.h |  8 
 xen/include/asm-x86/flushtlb.h|  2 ++
 8 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 8a7a76b8ff..38cedf3b22 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -160,5 +160,8 @@ unsigned int flush_area_local(const void *va, unsigned int 
flags)
 
 local_irq_restore(irqfl);
 
+if ( flags & FLUSH_ROOT_PGTBL )
+get_cpu_info()->root_pgt_changed = true;
+
 return flags;
 }
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f0195561c2..b1d3f0eda8 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -503,6 +503,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
+if ( this_cpu(root_pgt) )
+get_cpu_info()->root_pgt_changed = true;
 write_cr3(v->arch.cr3);
 }
 
@@ -3698,18 +3700,28 @@ long do_mmu_update(
 break;
 rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
   cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-/*
- * No need to sync if all uses of the page can be accounted
- * to the page lock we hold, its pinned status, and uses on
- * this (v)CPU.
- */
-if ( !rc && !cpu_has_no_xpti &&
- ((page->u.inuse.type_info & PGT_count_mask) >
-  (1 + !!(page->u.inuse.type_info & PGT_pinned) +
-   (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
-   (pagetable_get_pfn(curr->arch.guest_table_user) ==
-mfn))) )
-sync_guest = true;
+if ( !rc && !cpu_has_no_xpti )
+{
+bool local_in_use = false;
+
+if ( pagetable_get_pfn(curr->arch.guest_table) == mfn )
+{
+local_in_use = true;
+get_cpu_info()->root_pgt_changed = true;
+}
+
+/*
+ * No need to sync if all uses of the page can be
+ * accounted to the page lock we hold, its pinned
+ * status, and uses on this (v)CPU.
+ */
+if ( (page->u.inuse.type_info & PGT_count_mask) >
+ (1 + !!(page->u.inuse.type_info & PGT_pinned) +
+  (pagetable_get_pfn(curr->arch.guest_table_user) 
==
+   mfn) +
+  local_in_use) )
+sync_guest = true;
+}
 break;
 
 case PGT_writable_page:
@@ -3824,7 +3836,7 @@ long do_mmu_update(
 
 cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
 if ( !cp

[Xen-devel] [PATCH v4 7/7] xen/x86: use PCID feature

2018-03-27 Thread Juergen Gross
Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.

pv_guest_cr4_to_real_cr4() is switched from a macro to a real function
now as it has become more complex.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V4:
- add cr3 mask for page table address and use that in dbg_pv_va2mfn()
  (Jan Beulich)
- use invpcid_flush_all_nonglobals() instead of invpcid_flush_all()
  (Jan Beulich)
- use PCIDs 0/1 when running in Xen or without XPTI, 2/3 with XPTI in
  guest (Jan Beulich)
- ASSERT cr4.pge and cr4.pcide are never active at the same time
  (Jan Beulich)
- make pv_guest_cr4_to_real_cr4() a real function

V3:
- support PCID for non-XPTI case, too
- add command line parameter for controlling usage of PCID
- check PCID active by using cr4.pcide (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 12 ++
 xen/arch/x86/debug.c|  2 +-
 xen/arch/x86/domain_page.c  |  2 +-
 xen/arch/x86/domctl.c   |  4 ++
 xen/arch/x86/flushtlb.c | 44 +++--
 xen/arch/x86/mm.c   | 24 +++-
 xen/arch/x86/pv/dom0_build.c|  1 +
 xen/arch/x86/pv/domain.c| 76 -
 xen/include/asm-x86/domain.h| 15 +++-
 xen/include/asm-x86/processor.h |  3 ++
 xen/include/asm-x86/pv/domain.h | 20 ++
 xen/include/asm-x86/x86-defns.h |  4 +-
 12 files changed, 188 insertions(+), 19 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index 5f6ae654ad..db87fd326d 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1452,6 +1452,18 @@ All numbers specified must be hexadecimal ones.
 
 This option can be specified more than once (up to 8 times at present).
 
+### pcid (x86)
+> `=  | xpti | noxpti`
+
+> Default: `xpti`
+
+> Can be modified at runtime
+
+If available, control usage of the PCID feature of the processor for
+64-bit pv-domains. PCID can be used either for no domain at all (`false`),
+for all of them (`true`), only for those subject to XPTI (`xpti`) or for
+those not subject to XPTI (`noxpti`).
+
 ### ple\_gap
 > `= `
 
diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index 9159f32db4..0d46f2f45a 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -97,7 +97,7 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t 
pgd3val)
 l3_pgentry_t l3e, *l3t;
 l2_pgentry_t l2e, *l2t;
 l1_pgentry_t l1e, *l1t;
-unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
+unsigned long cr3 = pgd3val ?: (dp->vcpu[0]->arch.cr3 & X86_CR3_ADDR_MASK);
 mfn_t mfn = maddr_to_mfn(cr3);
 
 DBGP2("vaddr:%lx domid:%d cr3:%lx pgd3:%lx\n", vaddr, dp->domain_id, 
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index b5780f201f..bf5bf45188 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
 if ( (v = idle_vcpu[smp_processor_id()]) == current )
 sync_local_execstate();
 /* We must now be running on the idle page table. */
-ASSERT(read_cr3() == __pa(idle_pg_table));
+ASSERT((read_cr3() & ~X86_CR3_PCID_MASK) == __pa(idle_pg_table));
 }
 
 return v;
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 0704f398c7..a7c8772fa6 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -613,7 +613,11 @@ long arch_do_domctl(
 ret = -EINVAL;
 
 if ( ret == 0 )
+{
 xpti_domain_init(d);
+pcid_domain_init(d);
+}
+
 break;
 
 case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 5cb0ad97b8..96f09239f0 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
 #ifdef NDEBUG
@@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned l

[Xen-devel] [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB

2018-03-27 Thread Juergen Gross
If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "invpcid" for controlling the use of
INVPCID (default to true).

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V4:
- option "invpcid" instead of "noinvpcid" (Jan Beulich)

V3:
- new patch
---
 docs/misc/xen-command-line.markdown | 10 ++
 xen/arch/x86/cpu/mtrr/generic.c | 37 ++---
 xen/arch/x86/flushtlb.c | 31 +--
 xen/arch/x86/setup.c|  7 +++
 4 files changed, 64 insertions(+), 21 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index 79be9a6ba5..5f6ae654ad 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1380,6 +1380,16 @@ Because responsibility for APIC setup is shared between 
Xen and the
 domain 0 kernel this option is automatically propagated to the domain
 0 command line.
 
+### invpcid (x86)
+> `= `
+
+> Default: `true`
+
+Control using the INVPCID instruction for flushing TLB entries.
+This should only be used in case of known issues on the current platform
+with that instruction. Disabling INVPCID will normally result in a slightly
+degraded performance.
+
 ### noirqbalance
 > `= `
 
diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
index e9c0e5e059..e88643f4bf 100644
--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -400,8 +401,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
  * has been called.
  */
 
-static void prepare_set(void)
+static bool prepare_set(void)
 {
+   unsigned long cr4;
+
/*  Note that this is not ideal, since the cache is only 
flushed/disabled
   for this CPU while the MTRRs are changed, but changing this requires
   more invasive changes to the way the kernel boots  */
@@ -412,18 +415,24 @@ static void prepare_set(void)
write_cr0(read_cr0() | X86_CR0_CD);
wbinvd();
 
-   /*  TLB flushing here relies on Xen always using CR4.PGE. */
-   BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
-   write_cr4(read_cr4() & ~X86_CR4_PGE);
+   cr4 = read_cr4();
+   if (cr4 & X86_CR4_PGE)
+   write_cr4(cr4 & ~X86_CR4_PGE);
+   else if (cpu_has_invpcid)
+   invpcid_flush_all();
+   else
+   asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
/*  Save MTRR state */
rdmsrl(MSR_MTRRdefType, deftype);
 
/*  Disable MTRRs, and set the default type to uncached  */
mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
+
+   return cr4 & X86_CR4_PGE;
 }
 
-static void post_set(void)
+static void post_set(bool pge)
 {
/* Intel (P6) standard MTRRs */
mtrr_wrmsr(MSR_MTRRdefType, deftype);
@@ -432,7 +441,12 @@ static void post_set(void)
write_cr0(read_cr0() & ~X86_CR0_CD);
 
/*  Reenable CR4.PGE (also flushes the TLB) */
-   write_cr4(read_cr4() | X86_CR4_PGE);
+   if (pge)
+   write_cr4(read_cr4() | X86_CR4_PGE);
+   else if (cpu_has_invpcid)
+   invpcid_flush_all();
+   else
+   asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
spin_unlock(_atomicity_lock);
 }
@@ -441,14 +455,15 @@ static void generic_set_all(void)
 {
unsigned long mask, count;
unsigned long flags;
+   bool pge;
 
local_irq_save(flags);
-   prepare_set();
+   pge = prepare_set();
 
/* Actually set the state */
mask = set_mtrr_state();
 
-   post_set();
+   post_set(pge);
local_irq_restore(flags);
 
/*  Use the atomic bitops to update the global mask  */
@@ -457,7 +472,6 @@ static void generic_set_all(void)
set_bit(count, _changes_mask);
mask >>= 1;
}
-   
 }
 
 static void generic_set_mtrr(unsigned int reg, unsigned long base,
@@ -474,11 +488,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned 
long base,
 {
unsigned long flags;
struct mtrr_var_range *vr;
+   bool pge;
 
vr = _state.var_ranges[reg];
 
local_irq_save(flags);
-   prepare_set();
+   pge = prepare_set();
 
if (size == 0) {
/* The invalid bit is kept in the mask, so we simply clear the
@@ -499,7 +514,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned 
long base,
mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
}
 
-   post_set();

Re: [Xen-devel] [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-27 Thread Juergen Gross
On 22/03/18 17:30, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
> 
> I continue to be not entirely convinced of this move. I had an
> alternative in mind: Since retaining global pages is particularly
> relevant for switches between guest user and guest kernel
> modes, what if we made a shortcut from e.g. lstar_enter through
> switch_to_kernel to restore_all_guest without ever switching to
> the full page Xen tables?
> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>  struct cpu_info *cpu_info = get_cpu_info();
>> +unsigned long new_cr4;
>> +
>> +new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>> +  ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
> 
> I'm not overly happy to see any new uses of mmu_cr4_features.
> This should really only be used for priming certain values imo,
> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
> so too, and perhaps better wouldn't). Hence I wonder whether
> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
> because we've just got rid of the blanket reversion to
> mmu_cr4_features in VMX code.

I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
keep bits switched on which a pv domain is allowed to modify (plus
CR4_TSD eventually).

Do we really want that?

We could mask away certain bits, of course, but in the end we'd just
have a default calculated cr4 value instead of having it just set
somewhere initially.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-27 Thread Juergen Gross
On 27/03/18 09:23, Jan Beulich wrote:
 On 27.03.18 at 09:14,  wrote:
>> On 22/03/18 17:30, Jan Beulich wrote:
>> On 21.03.18 at 13:51,  wrote:
 Instead of flushing the TLB from global pages when switching address
 spaces with XPTI being active just disable global pages via %cr4
 completely when a domain subject to XPTI is active. This avoids the
 need for extra TLB flushes as loading %cr3 will remove all TLB
 entries.
>>>
>>> I continue to be not entirely convinced of this move. I had an
>>> alternative in mind: Since retaining global pages is particularly
>>> relevant for switches between guest user and guest kernel
>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>> switch_to_kernel to restore_all_guest without ever switching to
>>> the full page Xen tables?
>>>
 --- a/xen/arch/x86/mm.c
 +++ b/xen/arch/x86/mm.c
 @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
  void write_ptbase(struct vcpu *v)
  {
  struct cpu_info *cpu_info = get_cpu_info();
 +unsigned long new_cr4;
 +
 +new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
 +  ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>
>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>> This should really only be used for priming certain values imo,
>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>> because we've just got rid of the blanket reversion to
>>> mmu_cr4_features in VMX code.
>>
>> I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
>> keep bits switched on which a pv domain is allowed to modify (plus
>> CR4_TSD eventually).
>>
>> Do we really want that?
> 
> Does it matter what exact CR4 settings we run with when it's not
> a PV guest that's in context, and when we don't depend on the
> settings ourselves? I don't think it does, and HVM guests run with
> their own CR4 anyway. In fact there may end up being cases
> where we won't need to switch CR4 another time when we come
> here the next time with v being a PV vCPU.

I could imagine that there is some performance impact. cr4.tsd set
might make rdtsc a little bit slower as an additional privilege level
check is needed.

Suspending requires cr4.fsgsbase to be set, which might have been
reset by a pv guest.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-27 Thread Juergen Gross
On 27/03/18 10:33, Jan Beulich wrote:
 On 27.03.18 at 09:37,  wrote:
>> On 27/03/18 09:23, Jan Beulich wrote:
>> On 27.03.18 at 09:14,  wrote:
 On 22/03/18 17:30, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
>
> I continue to be not entirely convinced of this move. I had an
> alternative in mind: Since retaining global pages is particularly
> relevant for switches between guest user and guest kernel
> modes, what if we made a shortcut from e.g. lstar_enter through
> switch_to_kernel to restore_all_guest without ever switching to
> the full page Xen tables?
>
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>  struct cpu_info *cpu_info = get_cpu_info();
>> +unsigned long new_cr4;
>> +
>> +new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>> +  ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>
> I'm not overly happy to see any new uses of mmu_cr4_features.
> This should really only be used for priming certain values imo,
> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
> so too, and perhaps better wouldn't). Hence I wonder whether
> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
> because we've just got rid of the blanket reversion to
> mmu_cr4_features in VMX code.

 I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
 keep bits switched on which a pv domain is allowed to modify (plus
 CR4_TSD eventually).

 Do we really want that?
>>>
>>> Does it matter what exact CR4 settings we run with when it's not
>>> a PV guest that's in context, and when we don't depend on the
>>> settings ourselves? I don't think it does, and HVM guests run with
>>> their own CR4 anyway. In fact there may end up being cases
>>> where we won't need to switch CR4 another time when we come
>>> here the next time with v being a PV vCPU.
>>
>> I could imagine that there is some performance impact. cr4.tsd set
>> might make rdtsc a little bit slower as an additional privilege level
>> check is needed.
> 
> Quite possible, indeed. Another opinion on the route to take
> would be helpful here. Andrew?

I could mask away tsd, of course. I need to do so for pcide already,
so that would be just another bit reset in the mask.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] correct maintainers file

2018-03-28 Thread Juergen Gross
Correct wrong entry in MAINTAINERS file.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index eace09ed22..bb049c8664 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -416,7 +416,7 @@ F:  xen/arch/*/vm_event.c
 F: xen/arch/arm/mem_access.c
 F: xen/arch/x86/mm/mem_access.c
 F: xen/arch/x86/hvm/monitor.c
-F: xen/arch/x88/hvm/vm_event.c
+F: xen/arch/x86/hvm/vm_event.c
 F: xen/common/mem_access.c
 F: xen/common/monitor.c
 F: xen/common/vm_event.c
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen/acpi: off by one in read_acpi_id()

2018-03-28 Thread Juergen Gross
On 28/03/18 13:47, Dan Carpenter wrote:
> If acpi_id is == nr_acpi_bits, then we access one element beyond the end
> of the acpi_psd[] array or we set one bit beyond the end of the bit map
> when we do __set_bit(acpi_id, acpi_id_cst_present);
> 
> Fixes: 59a568029181 ("xen/acpi-processor: C and P-state driver that uploads 
> said data to hypervisor.")
> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>
> 
> diff --git a/drivers/xen/xen-acpi-processor.c 
> b/drivers/xen/xen-acpi-processor.c
> index c80195e8fbd1..d23c9c150199 100644
> --- a/drivers/xen/xen-acpi-processor.c
> +++ b/drivers/xen/xen-acpi-processor.c
> @@ -364,7 +364,7 @@ read_acpi_id(acpi_handle handle, u32 lvl, void *context, 
> void **rv)
>   }
>   /* There are more ACPI Processor objects than in x2APIC or MADT.
>* This can happen with incorrect ACPI SSDT declerations. */
> - if (acpi_id > nr_acpi_bits) {
> + if (acpi_id >= nr_acpi_bits) {
>   pr_debug("We only have %u, trying to set %u\n",
>nr_acpi_bits, acpi_id);

Can you please modify this message, too? E.g. something like:

pr_debug("max acpi id %u, trying to set %u\n",
 nr_acpi_bits - 1, acpi_id);

With that:

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-26 Thread Juergen Gross
On 26/03/18 12:13, Jan Beulich wrote:
 On 26.03.18 at 10:55,  wrote:
>> On 26/03/18 10:28, Jan Beulich wrote:
>> On 26.03.18 at 08:49,  wrote:
 On 23/03/18 16:58, Jan Beulich wrote:
 On 23.03.18 at 15:11,  wrote:
>> On 23/03/18 14:46, Jan Beulich wrote:
>>> So in the end the question is: Why not use just two PCIDs, and
>>> allow global pages just like we do now, with the added benefit
>>> that we no longer need to flush Xen's global TLB entries just
>>> because we want to get rid of PV guest user ones.
>>
>> I can't see how that would work without either needing some more TLB
>> flushes in order to prevent stale TLB entries or loosing the Meltdown
>> mitigation.
>>
>> Which %cr3/PCID combination should be used in hypervisor, guest kernel
>> and guest user mode?
>
> Xen would run with PCID 0 (and full Xen mappings) at all times
> (except early entry and late exit code of course). The guest would
> run with PCID 1 (and minimal Xen mappings) at all times. The switch
> of PCID eliminates the need for flushes on the way out and back in.

 You still need the kernel page tables flushed when switching to user
 mode, right?
>>>
>>> Of course.
>>>
>> Which pages would be global?
>
> Use of global pages would continue to be as today: Xen has some,
> and guest user mode has some. Of course it is quite possible that
> the use of global pages with a single guest PCID is still worse than
> no global pages with two guest PCIDs, but that's a separate step
> to take (and measure) imo.

 But global pages of Xen would either make it vulnerable with regard to
 Meltdown or you need a TLB flush again when switching between Xen and
 guest making all the PCID stuff moot.
>>>
>>> No - the guest would run with PCID 1 active, and global Xen TLB
>>> entries would exist for PCID 0 only.
>>
>> Uuh, global pages are accessible via all PCIDs. That's why they are
>> called global...
> 
> Okay, in that case all of what I've said in this regard was rubbish.
> (I don't, btw, think that this is the only sensible interpretation of
> "global" - it could as well mean protected from ordinary flushes
> within the given PCID realm.)

That's the reason I gave the reference to the SDM. It clearly states
that TLB entries with the global bit set don't have to match the current
PCID for being regarded to match.

> 
 - 2 PCIDs
 - no TLB flushes needed when switching between Xen and guest
 - when switching from guest kernel to guest user the kernel pages must
   be flushed from TLB
 - flushing of single guest user pages needs 2 changes of %cr3 and 2
   INVLPGs, switch code must be mapped to guest page tables
 - flushing of complete TLB via 1 INVPCID

 So the advantage of the 2 PCID solution are the single TLB entries for
 guest user pages compared to 2 entries for guest user pages accessed by
 the guest kernel or Xen.

 The disadvantage are the flushed guest kernel pages when executing user
 code, the more complicated single user page flushing and the dynamical
 Xen global bit handling.
>>>
>>> Right. In order to make forward progress here I think we should
>>> shelve the discussion on the 2-PCID alternative for now. What I'd
>>> like to ask for as a change to your current approach is to use
>>> PCID 0 for Xen rather than running Xen with PCIDs 2 or 3 when
>>> PCIDs are enabled, and (implicitly) with PCID 0 when they're
>>> disabled. Or alternatively don't use PCID 0 at all when PCIDs are
>>> enabled. I'm simply worried of us overlooking a case where PCID
>>> 0 TLB entries may be left in place (when switching between PCIDs
>>> enabled and PCIDs disabled) when they should have been flushed,
>>> opening back up a Meltdown-like attack window.
>> The reason I didn't use PCID 0 for running Xen was to use a few
>> INVPCID calls as possible for single page invalidation and still
>> covering the cases for PCID on while XPTI off and including PCID 0.
> 
> How would the number of INVPCIDs needed differ depending on
> the actual PCID values used?

See answer below.

>> I can change the scheme to use different values for guest PCIDs
>> with XPTI on, of course. Are you fine with:
>>
>> - XPTI off: PCID 0 = kernel, PCID 1 = user
>> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>> PCID 2 = kernel/guest, PCID 3 = user/guest
> 
> Yes, that would fit the first variant I've described. I take it you
> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
> is there a particular reason?

Yes. As written in the comment in flush_area_local() I can't be sure
whether the current address space is that of a domain with XPTI
enabled (the idle domain could be "current"). So I need to always
flush with PCID 0 and with the possible PCID values for a XPTI domain.
When using PCID 0 for XPTI as well I'll need 4 

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-26 Thread Juergen Gross
On 23/03/18 16:58, Jan Beulich wrote:
 On 23.03.18 at 15:11,  wrote:
>> On 23/03/18 14:46, Jan Beulich wrote:
>>> Valid point. Looking at all present uses of ->arch.cr3, it's probably
>>> indeed better the way you have it. However, I'm now wondering
>>> about something else: make_cr3() leaves PCID as zero for HVM
>>> and idle domains, but runs Xen with PCIDs 2 and 3 for (some) PV
>>> domains. That looks like an undesirable setup though - it would
>>> seem better to run Xen (with full page tables) with PCID 0 at all
>>> times.
>>>
>>> Then we'd have e.g.
>>> PCID 0  Xen (full page tables)
>>> PCID x  PV guest priv
>>> PCID y  PV guest user
>>
>> So this would need another way to switch between guest and xen %cr3.
>> Or would you want to use different %cr3 values with the same PCID
>> without flushing the TLB in between? This seems to be a way to ask for
>> problems...
> 
> Well, a TLB flush is clearly needed in such a setup when going
> from kernel to user mode.
> 
>> In case you'd use the same %cr3 (guest kernel one, I guess) for both
>> cases: are you really sure there is no problem in any hypervisor path
>> accessing guest data which would result in using guest kernel access
>> rights when coming from user mode (BTW: that was the security note I
>> had in v2 of my series).
> 
> I'm afraid I don't understand: Same %cr3? There are separate
> kernel and user page tables, requiring different values anyway.
> I also don't understand what problems in hypervisor code paths
> you suspect, when everything looks to work fine right now
> without PCID.

With switching between different page tables you need to flush the TLB.
That was my point.

> 
>>> Global pages in PCID 0 could then still be permitted, and wouldn't
>>> ever need flushing except when FLUSH_TLB_GLOBAL is requested.
>>>
>>> As to the use of two separate PCIDs for PV kernel and user modes
>>> - while this helps isolation, it prevents recovering the non-XPTI
>>> property of user mode TLB entries surviving in-guest mode switches.
>>
>> I don't get that. With PCID the guest's kernel _and_ user entries
>> will survive in-guest mode switches as there is no TLB flushing
>> involved (the no-flush bit is set in v->arch.cr3 for both modes).
>>
>> The only downside are guest kernel accesses to user pages: they will
>> need additional TLB entries as the PCID is different.
> 
> That's the point I was trying to make. This was further explained
> in my previous reply a little further down.
> 
>>> I wonder whether this is part of the reason you see PCID have a
>>> negative effect in the non-XPTI case.
>>>
>>> So in the end the question is: Why not use just two PCIDs, and
>>> allow global pages just like we do now, with the added benefit
>>> that we no longer need to flush Xen's global TLB entries just
>>> because we want to get rid of PV guest user ones.
>>
>> I can't see how that would work without either needing some more TLB
>> flushes in order to prevent stale TLB entries or loosing the Meltdown
>> mitigation.
>>
>> Which %cr3/PCID combination should be used in hypervisor, guest kernel
>> and guest user mode?
> 
> Xen would run with PCID 0 (and full Xen mappings) at all times
> (except early entry and late exit code of course). The guest would
> run with PCID 1 (and minimal Xen mappings) at all times. The switch
> of PCID eliminates the need for flushes on the way out and back in.

You still need the kernel page tables flushed when switching to user
mode, right?

> 
>> Which pages would be global?
> 
> Use of global pages would continue to be as today: Xen has some,
> and guest user mode has some. Of course it is quite possible that
> the use of global pages with a single guest PCID is still worse than
> no global pages with two guest PCIDs, but that's a separate step
> to take (and measure) imo.

But global pages of Xen would either make it vulnerable with regard to
Meltdown or you need a TLB flush again when switching between Xen and
guest making all the PCID stuff moot.

> 
 I don't
 want to use global guest user pages together with PCID as flushing
 global pages from the TLB with PCID enabled requires flushing either
 the complete TLB or you'd have to use INVLPG in all possible address
 spaces (so you'd need to have multiple %cr3 switches).
>>>
>>> Well, yes, flushing _individual_ pages is a problem in that case.
>>> As to multiple CR3 switches - are these all that bad really with
>>> the no-flush bit set? With the reduced number of PCIDs in actual
>>> use (as discussed above) "all possible address spaces" would
>>> mean just two. And I could imagine that in a number of cases
>>> just one INVLPG (with the right PCID active) might suffice.
>>>
>>> One complicating factor is that we don't want to introduce
>>> Xen TLB entries for other than what we map in the minimal page
>>> tables into PV guest PCID space, which would happen if we
>>> simply switched PCID around an INVLPG.
>>>
>>> What I don't understand 

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-26 Thread Juergen Gross
On 26/03/18 10:28, Jan Beulich wrote:
 On 26.03.18 at 08:49,  wrote:
>> On 23/03/18 16:58, Jan Beulich wrote:
>> On 23.03.18 at 15:11,  wrote:
 On 23/03/18 14:46, Jan Beulich wrote:
> So in the end the question is: Why not use just two PCIDs, and
> allow global pages just like we do now, with the added benefit
> that we no longer need to flush Xen's global TLB entries just
> because we want to get rid of PV guest user ones.

 I can't see how that would work without either needing some more TLB
 flushes in order to prevent stale TLB entries or loosing the Meltdown
 mitigation.

 Which %cr3/PCID combination should be used in hypervisor, guest kernel
 and guest user mode?
>>>
>>> Xen would run with PCID 0 (and full Xen mappings) at all times
>>> (except early entry and late exit code of course). The guest would
>>> run with PCID 1 (and minimal Xen mappings) at all times. The switch
>>> of PCID eliminates the need for flushes on the way out and back in.
>>
>> You still need the kernel page tables flushed when switching to user
>> mode, right?
> 
> Of course.
> 
 Which pages would be global?
>>>
>>> Use of global pages would continue to be as today: Xen has some,
>>> and guest user mode has some. Of course it is quite possible that
>>> the use of global pages with a single guest PCID is still worse than
>>> no global pages with two guest PCIDs, but that's a separate step
>>> to take (and measure) imo.
>>
>> But global pages of Xen would either make it vulnerable with regard to
>> Meltdown or you need a TLB flush again when switching between Xen and
>> guest making all the PCID stuff moot.
> 
> No - the guest would run with PCID 1 active, and global Xen TLB
> entries would exist for PCID 0 only.

Uuh, global pages are accessible via all PCIDs. That's why they are
called global...

> 
>> So lets compare the possibilities:
>>
>> My approach:
>> - no global pages
>> - 4 different PCIDs
>> - no TLB flushes needed when switching between Xen and guest
>> - no TLB flushes needed when switching between guest user and kernel
>> - flushing of single pages (guest or Xen) rather simple (4 INVPCIDs)
>> - flushing of complete TLB via 1 INVPCID
>>
>> 2 PCIDs (Xen and guest), keeping guest user pages as global pages
>> - Xen can't use global pages - global bit must be handled dynamically
>>   for Xen pages (or do we want to drop global pages e.g. for AMD, too?
> 
> As per above - I don't see why Xen couldn't use global pages.
> The option of using them is part of why I'm wondering whether
> this might be worth looking into.

See chapter 4.10.2.4 SDM Vol. 3

> 
>> - 2 PCIDs
>> - no TLB flushes needed when switching between Xen and guest
>> - when switching from guest kernel to guest user the kernel pages must
>>   be flushed from TLB
>> - flushing of single guest user pages needs 2 changes of %cr3 and 2
>>   INVLPGs, switch code must be mapped to guest page tables
>> - flushing of complete TLB via 1 INVPCID
>>
>> So the advantage of the 2 PCID solution are the single TLB entries for
>> guest user pages compared to 2 entries for guest user pages accessed by
>> the guest kernel or Xen.
>>
>> The disadvantage are the flushed guest kernel pages when executing user
>> code, the more complicated single user page flushing and the dynamical
>> Xen global bit handling.
> 
> Right. In order to make forward progress here I think we should
> shelve the discussion on the 2-PCID alternative for now. What I'd
> like to ask for as a change to your current approach is to use
> PCID 0 for Xen rather than running Xen with PCIDs 2 or 3 when
> PCIDs are enabled, and (implicitly) with PCID 0 when they're
> disabled. Or alternatively don't use PCID 0 at all when PCIDs are
> enabled. I'm simply worried of us overlooking a case where PCID
> 0 TLB entries may be left in place (when switching between PCIDs
> enabled and PCIDs disabled) when they should have been flushed,
> opening back up a Meltdown-like attack window.
The reason I didn't use PCID 0 for running Xen was to use a few
INVPCID calls as possible for single page invalidation and still
covering the cases for PCID on while XPTI off and including PCID 0.

I can change the scheme to use different values for guest PCIDs
with XPTI on, of course. Are you fine with:

- XPTI off: PCID 0 = kernel, PCID 1 = user
- XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
PCID 2 = kernel/guest, PCID 3 = user/guest

Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Cut-off date for Xen 4.11 is March 30th, 2018

2018-03-29 Thread Juergen Gross
Hi all,

The cut-off date for Xen 4.11 is March 30th, 2018. If you want your
features to be included for the release, please make sure they are
committed by March 30th, 2018.

Juergen Gross

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Freeze date for 4.11 shifted by one week

2018-03-29 Thread Juergen Gross
Hi all,

as the original freeze date (March 30th, 2018) is a holiday in many
countries and some of the maintainers have been very busy with
security work during most of the development phase of Xen 4.11 I've
decided to shift the freeze date of Xen 4.11 by one week.

So the new freeze date will be April 6th 2018.

Patches not being committed until then will miss Xen 4.11.

This is a one-time decision not automatically applying to future
releases.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Cut-off date for Xen 4.11 is March 30th, 2018

2018-03-29 Thread Juergen Gross
On 29/03/18 11:53, George Dunlap wrote:
> On 03/29/2018 07:52 AM, Juergen Gross wrote:
>> Hi all,
>>
>> The cut-off date for Xen 4.11 is March 30th, 2018. If you want your
>> features to be included for the release, please make sure they are
>> committed by March 30th, 2018.
> 
> March 30th is a public holiday here in the UK.  Is it the same in
> Germany?  Would it be OK to say that things sent on Friday can be
> committed on Tuesday 3 April if the appropriate maintainer wasn't around
> to review them?
> 
> If not we should warn people to get their stuff reviewed today if at all
> possible.
> 
> As it happens I'll be working Friday so I can check in stuff that's got
> the right Acks / R-b's; but I won't do last-pass reviews on behalf of
> maintainers.

I already thought of shifting the freeze by one week. Main reason is
that several maintainers seem to have a backlog of patches to review
which IMO should make it into 4.11.

Thoughts?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Cut-off date for Xen 4.11 is March 30th, 2018

2018-03-29 Thread Juergen Gross
On 29/03/18 12:50, Wei Liu wrote:
> On Thu, Mar 29, 2018 at 04:35:27AM -0600, Jan Beulich wrote:
>>>>> On 29.03.18 at 12:22, <george.dun...@citrix.com> wrote:
>>> On 03/29/2018 10:56 AM, Juergen Gross wrote:
>>>> On 29/03/18 11:53, George Dunlap wrote:
>>>>> On 03/29/2018 07:52 AM, Juergen Gross wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> The cut-off date for Xen 4.11 is March 30th, 2018. If you want your
>>>>>> features to be included for the release, please make sure they are
>>>>>> committed by March 30th, 2018.
>>>>>
>>>>> March 30th is a public holiday here in the UK.  Is it the same in
>>>>> Germany?  Would it be OK to say that things sent on Friday can be
>>>>> committed on Tuesday 3 April if the appropriate maintainer wasn't around
>>>>> to review them?
>>>>>
>>>>> If not we should warn people to get their stuff reviewed today if at all
>>>>> possible.
>>>>>
>>>>> As it happens I'll be working Friday so I can check in stuff that's got
>>>>> the right Acks / R-b's; but I won't do last-pass reviews on behalf of
>>>>> maintainers.
>>>>
>>>> I already thought of shifting the freeze by one week. Main reason is
>>>> that several maintainers seem to have a backlog of patches to review
>>>> which IMO should make it into 4.11.
>>>>
>>>> Thoughts?
>>>
>>> Well there's a benefit to this, but also a risk: that people will begin
>>> to see the "hard freeze" as more like a "soft freeze", that will always
>>> be pushed back / flexed if you push hard enough.  Part of the purpose of
>>> setting the hard freeze months in advance is so that people can plan
>>> ahead and get stuff reviewed in time; part of the reason for having
>>> 6-month releases is so that the cost of delaying a feature / patchset to
>>> the next release isn't very high.
>>
>> As mentioned before I think anyway that we should revisit this
>> hard freeze date approach. I would much favor a hard freeze
>> date on where it is determined which features are intended to
>> make it and which not. Right now at least everything related to
>> Spectre and Meltdown would imo want to go into the category
>> of "we'll wait until it's in".
>>
> 
> You're mixing up two things: features and security fixes (and their
> subsequent patches). I agree the latter should get special attention
> because missing those would essentially render a release useless or
> unattractive.  Meltdown and Spectre fall into the second category, as
> with all the XSAs.

And we still have the possibility of individual Release-Acks.

> But most of the time, and most developers / contributors write new
> features.  If they are identified with strategic importance, we should
> wait (livepatching comes to mind), but for normal ones (which noone
> argues for), we should have the default position to not wait.
> 
> This isn't incompatible with what you said.

Right.

Still I think shifting by one week, given the current situation where
some maintainers had to spend a significant amount of the development
phase with security stuff instead of being able to review patches, is
a sensible thing to do.

So I think I'll do that with making it very clear that this won't be the
default process for the following releases.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-26 Thread Juergen Gross
On 26/03/18 12:43, Jan Beulich wrote:
 On 26.03.18 at 12:29,  wrote:
>> On 26/03/18 12:13, Jan Beulich wrote:
>> On 26.03.18 at 10:55,  wrote:
 I can change the scheme to use different values for guest PCIDs
 with XPTI on, of course. Are you fine with:

 - XPTI off: PCID 0 = kernel, PCID 1 = user
 - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
 PCID 2 = kernel/guest, PCID 3 = user/guest
>>>
>>> Yes, that would fit the first variant I've described. I take it you
>>> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
>>> is there a particular reason?
>>
>> Yes. As written in the comment in flush_area_local() I can't be sure
>> whether the current address space is that of a domain with XPTI
>> enabled (the idle domain could be "current"). So I need to always
>> flush with PCID 0 and with the possible PCID values for a XPTI domain.
>> When using PCID 0 for XPTI as well I'll need 4 INVPCIDs, while when
>> avoiding it I'd need 5 (at least when current == idle).
> 
> I see. Which makes me wonder whether a suitable combination
> of INVLPG (to get rid of global entries) and INVPCID couldn't be
> used instead. For example, you may be able to replace the
> INVPCID for the active PCID by INVLPG (without needing to
> know who "current" is).

INVLPG has the disadvantage to clear all paging-structure cache
entries associated with the current PCID.

And I thought we were finally on the same page not to use global
pages with PCID enabled?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-26 Thread Juergen Gross
On 26/03/18 14:19, Jan Beulich wrote:
 On 26.03.18 at 14:04,  wrote:
>> On 26/03/18 12:43, Jan Beulich wrote:
>> On 26.03.18 at 12:29,  wrote:
 On 26/03/18 12:13, Jan Beulich wrote:
 On 26.03.18 at 10:55,  wrote:
>> I can change the scheme to use different values for guest PCIDs
>> with XPTI on, of course. Are you fine with:
>>
>> - XPTI off: PCID 0 = kernel, PCID 1 = user
>> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>> PCID 2 = kernel/guest, PCID 3 = user/guest
>
> Yes, that would fit the first variant I've described. I take it you
> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
> is there a particular reason?

 Yes. As written in the comment in flush_area_local() I can't be sure
 whether the current address space is that of a domain with XPTI
 enabled (the idle domain could be "current"). So I need to always
 flush with PCID 0 and with the possible PCID values for a XPTI domain.
 When using PCID 0 for XPTI as well I'll need 4 INVPCIDs, while when
 avoiding it I'd need 5 (at least when current == idle).
>>>
>>> I see. Which makes me wonder whether a suitable combination
>>> of INVLPG (to get rid of global entries) and INVPCID couldn't be
>>> used instead. For example, you may be able to replace the
>>> INVPCID for the active PCID by INVLPG (without needing to
>>> know who "current" is).
>>
>> INVLPG has the disadvantage to clear all paging-structure cache
>> entries associated with the current PCID.
>>
>> And I thought we were finally on the same page not to use global
>> pages with PCID enabled?
> 
> Yes, we are. If no global TLB entries can ever survive a CR4.PCIDE
> 0 -> 1 transition, all would be fine without INVLPG of course.

Okay, I'll add an ASSERT() to make sure this is true:

ASSERT(!cr4.pge || !cr4.pcide)


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 7/7] xen/x86: use PCID feature

2018-04-03 Thread Juergen Gross
On 29/03/18 17:37, Jan Beulich wrote:
 On 29.03.18 at 17:15,  wrote:
>> On 29/03/18 16:19, Jan Beulich wrote:
>> On 27.03.18 at 11:07,  wrote:
 @@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long 
 cr4)
  t = pre_flush();
  
  if ( read_cr4() & X86_CR4_PGE )
 +/*
 + * X86_CR4_PGE set means PCID being inactive.
 + * We have to purge the TLB via flipping cr4.pge.
 + */
  write_cr4(cr4 & ~X86_CR4_PGE);
 +else if ( cpu_has_invpcid )
 +/*
 + * If we are using PCID purge the TLB via INVPCID as loading cr3
 + * will affect the current PCID only.
>>>
>>> s/current/new/ ?
>>
>> Okay.
>>
>>>
 + * If INVPCID is not supported we don't use PCIDs so loading cr3
 + * will purge the TLB (we are in the "global pages off" branch).
 + * invpcid_flush_all_nonglobals() seems to be faster than
 + * invpcid_flush_all().
 + */
 +invpcid_flush_all_nonglobals();
  
  asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>>>
>>> What about the TLB entries that have been re-created between
>>> the INVPCID and the write of CR3? It's not obvious to me that
>>> leaving them around is not going to be a problem subsequently,
>>> the more that you write cr3 frequently with the no-flush bit set.
>>
>> The no-flush bit should not make any difference here. It controls only
>> flushing of TLB-entries with the new PCID.
> 
> Right, but in a subsequent write to CR3 you may make active again
> what was the old PCID here. This is in particular so for PCID 0 (which
> has dual use).
> 
>>> Don't you need to do a single context invalidation for the prior
>>> PCID (if different from the new one)?
>>
>> Hmm, I don't think so, as the mov to %cr3 is a serializing instruction
>> acting as a speculation barrier. So the only TLB entries which could
>> survive would be the ones for the few instruction bytes after the
>> invpcid instruction until the end of the mov to %cr3. Those are harmless
>> as they are associated with the hypervisor PCID value, so there is no
>> risk of any data leak to a guest. Maybe a comment explaining that would
>> be a good idea.
> 
> Well, to be honest I don't trust in things like "speculation barrier"
> anymore, at least not as far as things ahead of the barrier go.
> Who knows what exactly the CPU does (and hence which TLB
> entries it creates) between the INVPCID and the CR3 write. I
> don't think we can blindly assume only entries for Xen mappings
> could be created during that window.

Those speculation barriers are one of the main mitigations for Spectre.
So either we don't think Spectre can be mitigated by using those or we
should trust the barriers to be effective in this case, too, IMHO.

Which documented behavior of the processor are you going to trust? I
think as long as there are no known errata in this regard we should
assume the cpu will behave as documented. And mv to %cr3 is documented
to be serializing nad serializing instruction are documented to be
speculation barriers.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 1/8] x86: NOP out XPTI entry/exit code when it's not in use

2018-04-03 Thread Juergen Gross
On 19/03/18 14:37, Jan Beulich wrote:
> Introduce a synthetic feature flag to use alternative instruction
> patching to NOP out all code on entry/exit paths. Having NOPs here is
> generally better than using conditional branches.
> 
> Also change the limit on the number of bytes we can patch in one go to
> that resulting from the encoding in struct alt_instr - there's no point
> reducing it below that limit, and without a check being in place that
> the limit isn't actually exceeded, such an artificial boundary is a
> latent risk.
> 
> Signed-off-by: Jan Beulich 

Just did a parallel make of the hypervisor with and without the patch,
with xpti=true and with xpti=false (values in braces are stddev).

   elapsed system  user
unpatched, xpti=false: 118.69 ( 1.40)  168.38 (12.64)  186.11 ( 5.49)
unpatched, xpti=true : 128.02 ( 5.97)  219.66 (23.06)  197.84 ( 4.53)
patched,   xpti=false:  90.65 ( 6.63)   99.50 (14.79)  180.35 ( 5.97)
patched,   xpti=true : 111.69 ( 9.93)  163.63 (13.05)  181.22 ( 3.71)


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 1/8] x86: NOP out XPTI entry/exit code when it's not in use

2018-04-04 Thread Juergen Gross
On 03/04/18 19:48, Juergen Gross wrote:
> On 19/03/18 14:37, Jan Beulich wrote:
>> Introduce a synthetic feature flag to use alternative instruction
>> patching to NOP out all code on entry/exit paths. Having NOPs here is
>> generally better than using conditional branches.
>>
>> Also change the limit on the number of bytes we can patch in one go to
>> that resulting from the encoding in struct alt_instr - there's no point
>> reducing it below that limit, and without a check being in place that
>> the limit isn't actually exceeded, such an artificial boundary is a
>> latent risk.
>>
>> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> 
> Just did a parallel make of the hypervisor with and without the patch,
> with xpti=true and with xpti=false (values in braces are stddev).

The unpatched version was configured differently than the patched one.
So the real numbers are:

   elapsed system  user
unpatched, xpti=false:  89.96 ( 8.07)   97.05 ( 5.69)  178.64 ( 2.39)
unpatched, xpti=true : 113.42 ( 9.80)  165.99 (15.10)  180.99 ( 2.66)
patched,   xpti=false:  90.65 ( 6.63)   99.50 (14.79)  180.35 ( 5.97)
patched,   xpti=true : 111.69 ( 9.93)  163.63 (13.05)  181.22 ( 3.71)

So the XPTI case is a little bit faster with the patch, while the
non-XPTI case is a little bit slower.

Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [linux-linus bisection] complete test-amd64-amd64-xl-pvhv2-amd

2018-04-04 Thread Juergen Gross
On 03/04/18 18:55, Sander Eikelenboom wrote:
> On 03/04/18 12:29, Juergen Gross wrote:
>> On 03/04/18 12:19, osstest service owner wrote:
>>> branch xen-unstable
>>> xenbranch xen-unstable
>>> job test-amd64-amd64-xl-pvhv2-amd
>>> testid guest-start
>>>
>>> Tree: linux 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>>> Tree: linuxfirmware git://xenbits.xen.org/osstest/linux-firmware.git
>>> Tree: qemu git://xenbits.xen.org/qemu-xen-traditional.git
>>> Tree: qemuu git://xenbits.xen.org/qemu-xen.git
>>> Tree: xen git://xenbits.xen.org/xen.git
>>>
>>> *** Found and reproduced problem changeset ***
>>>
>>>   Bug is in tree:  xen git://xenbits.xen.org/xen.git
>>>   Bug introduced:  4a5733771e6f33918eba07b584e564a67ac1
>>>   Bug not present: 1c2e0f9e4f263714db917eb54f8d1c2d1463ed4c
>>>   Last fail repro: http://logs.test-lab.xenproject.org/osstest/logs/118498/
>>>
>>>
>>>   commit 4a5733771e6f33918eba07b584e564a67ac1
>>>   Author: Juergen Gross <jgr...@suse.com>
>>>   Date:   Fri Dec 1 15:14:07 2017 +0100
>>>   
>>>   libxl: put RSDP for PVH guest near 4GB
>>>   
>>>   Instead of locating the RSDP table below 1MB put it just below 4GB
>>>   like the rest of the ACPI tables in case of PVH guests. This will
>>>   avoid punching more holes than necessary into the memory map.
>>>   
>>>   Signed-off-by: Juergen Gross <jgr...@suse.com>
>>>   Acked-by: Wei Liu <wei.l...@citrix.com>
>>>   Reviewed-by: Roger Pau Monné <roger@citrix.com>
>>
>> The corresponding Linux kernel patch just made it upstream.
>>
>>
>> Juergen
> 
> Hi Juergen,
> 
> Are those kernel patches heading for linux-stable as well ?

That was the plan. They'll probably need some adjustments.

> 
> I ask this, because it would be nice to be able to use PVH on Xen 4.11 
> release with a distro kernel
> (4.9 or 4.14 stable for instance for Debian).
> PVH worked fine with xen-4.11-to-be up until this commit, so the kernel 
> patches fix this (non-kernel) regression.

This _is_ a kernel regression as the kernel didn't comply to the
PVH ABI. It relied on a not guaranteed behavior of Xen.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 7/7] xen/x86: use PCID feature

2018-03-29 Thread Juergen Gross
On 29/03/18 16:19, Jan Beulich wrote:
 On 27.03.18 at 11:07,  wrote:
>> --- a/xen/arch/x86/domain_page.c
>> +++ b/xen/arch/x86/domain_page.c
>> @@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
>>  if ( (v = idle_vcpu[smp_processor_id()]) == current )
>>  sync_local_execstate();
>>  /* We must now be running on the idle page table. */
>> -ASSERT(read_cr3() == __pa(idle_pg_table));
>> +ASSERT((read_cr3() & ~X86_CR3_PCID_MASK) == __pa(idle_pg_table));
> 
> This would better use X86_CR3_ADDR_MASK as well.

Right.

> 
>> @@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>  t = pre_flush();
>>  
>>  if ( read_cr4() & X86_CR4_PGE )
>> +/*
>> + * X86_CR4_PGE set means PCID being inactive.
>> + * We have to purge the TLB via flipping cr4.pge.
>> + */
>>  write_cr4(cr4 & ~X86_CR4_PGE);
>> +else if ( cpu_has_invpcid )
>> +/*
>> + * If we are using PCID purge the TLB via INVPCID as loading cr3
>> + * will affect the current PCID only.
> 
> s/current/new/ ?

Okay.

> 
>> + * If INVPCID is not supported we don't use PCIDs so loading cr3
>> + * will purge the TLB (we are in the "global pages off" branch).
>> + * invpcid_flush_all_nonglobals() seems to be faster than
>> + * invpcid_flush_all().
>> + */
>> +invpcid_flush_all_nonglobals();
>>  
>>  asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
> 
> What about the TLB entries that have been re-created between
> the INVPCID and the write of CR3? It's not obvious to me that
> leaving them around is not going to be a problem subsequently,
> the more that you write cr3 frequently with the no-flush bit set.

The no-flush bit should not make any difference here. It controls only
flushing of TLB-entries with the new PCID.

> Don't you need to do a single context invalidation for the prior
> PCID (if different from the new one)?

Hmm, I don't think so, as the mov to %cr3 is a serializing instruction
acting as a speculation barrier. So the only TLB entries which could
survive would be the ones for the few instruction bytes after the
invpcid instruction until the end of the mov to %cr3. Those are harmless
as they are associated with the hypervisor PCID value, so there is no
risk of any data leak to a guest. Maybe a comment explaining that would
be a good idea.

> 
>> @@ -499,7 +500,26 @@ void free_shared_domheap_page(struct page_info *page)
>>  
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>> +struct domain *d = v->domain;
>> +
>>  v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> +if ( is_pv_domain(d) && d->arch.pv_domain.pcid )
>> +v->arch.cr3 |= get_pcid_bits(v, false);
>> +}
>> +
>> +unsigned long pv_guest_cr4_to_real_cr4(struct vcpu *v)
> 
> const

Okay (for the other cases, too).

> 
>> +{
>> +struct domain *d = v->domain;
> 
> again
> 
>> @@ -298,11 +362,21 @@ int pv_domain_initialise(struct domain *d)
>>  
>>  static void _toggle_guest_pt(struct vcpu *v, bool force_cr3)
>>  {
>> +struct domain *d = v->domain;
> 
> and one more
> 
>> --- a/xen/include/asm-x86/x86-defns.h
>> +++ b/xen/include/asm-x86/x86-defns.h
>> @@ -45,7 +45,9 @@
>>  /*
>>   * Intel CPU flags in CR3
>>   */
>> -#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
>> +#define X86_CR3_NOFLUSH(_AC(1, ULL) << 63)
>> +#define X86_CR3_ADDR_MASK  (PAGE_MASK & ~X86_CR3_NOFLUSH)
> 
> This would better be PAGE_MASK & PADDR_MASK: bits 52...62
> are reserved (the respective figure in chapter 2 of the SDM is not to
> be trusted, the tables in the "4-level paging" section are more likely to
> be correct).

Okay.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB

2018-03-29 Thread Juergen Gross
On 29/03/18 15:44, Jan Beulich wrote:
 On 27.03.18 at 11:07,  wrote:
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -63,6 +63,10 @@ boolean_param("nosmp", opt_nosmp);
>>  static unsigned int __initdata max_cpus;
>>  integer_param("maxcpus", max_cpus);
>>  
>> +/* opt_invpcid: If false, don't use INVPCID instruction even if available. 
>> */
>> +static bool __initdata opt_invpcid = true;
>> +boolean_param("invpcid", opt_invpcid);
> 
> Hmm, I'm sorry for noticing only now (while seeing the questionable
> uses of cpu_has_invpcid in patch 7), but this being an init-only
> variable and having ...
> 
>> @@ -1549,6 +1553,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>  if ( cpu_has_fsgsbase )
>>  set_in_cr4(X86_CR4_FSGSBASE);
>>  
>> +if ( !opt_invpcid )
>> +setup_clear_cpu_cap(X86_FEATURE_INVPCID);
> 
> ... this effect has two issues: For one, in such a case this should
> be a sub-option to "cpuid=". And then afaict it also disables use of
> INVPCID in HVM guests. IOW I think you want to retain the option
> but make the variable non-init and non-static. Obviously for early
> boot use it may then no longer be possible to set it to true at build
> time (you may end up needing two variables).

Okay. I'll change it.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] broken build: 448c03b3cbe14873ee637755a29ea26ee7ca9ef9

2018-03-20 Thread Juergen Gross
On 19/03/18 20:04, Doug Goldstein wrote:
> commit 448c03b3cbe14873ee637755a29ea26ee7ca9ef9
> Author: Juergen Gross <jgr...@suse.com>
> Date:   Mon Feb 26 09:46:12 2018 +0100
> 
> This commit breaks the build of qemu-xen-traditional for:
> 
> Ubuntu 14.04: https://gitlab.com/cardoe/xen/-/jobs/58266170
> Ubuntu 16.04: https://gitlab.com/cardoe/xen/-/jobs/58266174
> 
> A short snippet of the failure is:
> 
>   ARi386-dm/libqemu.a
>   LINK  i386-dm/qemu-dm
> /builds/cardoe/xen/tools/../tools/xenstore/libxenstore.so: undefined
> reference to `dlsym'
> collect2: error: ld returned 1 exit status
> make[5]: *** [qemu-dm] Error 1
> make[5]: Leaving directory
> `/builds/cardoe/xen/tools/qemu-xen-traditional-dir-remote/i386-dm'
> make[4]: *** [subdir-i386-dm] Error 2
> make[4]: Leaving directory
> `/builds/cardoe/xen/tools/qemu-xen-traditional-dir-remote'
> make[3]: *** [subdir-all-qemu-xen-traditional-dir] Error 2
> make[3]: Leaving directory `/builds/cardoe/xen/tools'
> make[2]: *** [subdirs-install] Error 2
> make[2]: Leaving directory `/builds/cardoe/xen/tools'
> make[1]: *** [install] Error 2
> make[1]: Leaving directory `/builds/cardoe/xen/tools'
> make: *** [install-tools] Error 2
> 

Did you have commit c9bd8a73656d7435b1055ee8825823aee995993e ?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/xen: Delay get_cpu_cap until stack canary is established

2018-03-20 Thread Juergen Gross
On 19/03/18 23:22, Boris Ostrovsky wrote:
> On 03/19/2018 12:58 PM, Jason Andryuk wrote:
>> Commit 2cc42bac1c79 ("x86-64/Xen: eliminate W+X mappings") introduced a
>> call to get_cpu_cap, which is fstack-protected.  This is works on x86-64
> 
> s/This is works/This works/
> 
> Reviewed-by: Boris Ostrovsky 
> 
> Do we still need 4f277295e54?

I'd rather keep it in order to avoid nasty problems in case something
changes. After all we are trying to do an initialization in C code
which should be done in assembly before entering the C part. Doing this
properly for 32-bit pv-kernels would be rather difficult, but this is no
reason to drop the correct solution for the 64-bit case.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] TLB flushing

2018-03-20 Thread Juergen Gross
While hunting a strange bug in my PCID patch series hinting at some
TLB invalidation problem I discovered a piece of code looking rather
fishy to me.

Is it correct for new_tlbflush_clock_period() to use FLUSH_TLB instead
of FLUSH_TLB_GLOBAL?

While not being a problem in current code as both will flush all TLB
entries my series will change that by using invpcid to flush only the
non-global entries if FLUSH_TLB_GLOBAL wasn't set.

I can send a patch if anyone can confirm that using FLUSH_TLB only is
wrong.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] TLB flushing

2018-03-20 Thread Juergen Gross
On 20/03/18 10:19, Jan Beulich wrote:
 On 20.03.18 at 09:50,  wrote:
>> While hunting a strange bug in my PCID patch series hinting at some
>> TLB invalidation problem I discovered a piece of code looking rather
>> fishy to me.
>>
>> Is it correct for new_tlbflush_clock_period() to use FLUSH_TLB instead
>> of FLUSH_TLB_GLOBAL?
>>
>> While not being a problem in current code as both will flush all TLB
>> entries my series will change that by using invpcid to flush only the
>> non-global entries if FLUSH_TLB_GLOBAL wasn't set.
>>
>> I can send a patch if anyone can confirm that using FLUSH_TLB only is
>> wrong.
> 
> I think this shouldn't be a separate patch, but an integral part of the
> one introducing the distinction between "all" and non-global flushes.
> This is because
> - right now it doesn't make a difference (we do "all" flushes anyway),
> - back in the 32-bit days it didn't matter because guest mappings
>   would never have been allowed to be global, and transient Xen
>   mappings also would never have had the G bit set.
> IOW with what used to be named USER_MAPPINGS_ARE_GLOBAL
> this would need to become FLUSH_TLB_GLOBAL at the point the
> kind of flush gets altered, while without it could remain at FLUSH_TLB.

Really? Aren't global hypervisor mappings affected by this, too?

> Perhaps it is worthwhile to retain this distinction just for
> documentation purposes (in case a future change wants to turn off
> that USER_MAPPINGS_ARE_GLOBAL behavior for whatever reason).

I think as long as FLUSH_TLB_GLOBAL is being used in the code
new_tlbflush_clock_period() should do so, too. In case there is no need
to do TLB flushes including global pages, FLUSH_TLB_GLOBAL can be
modified to do only non-global flushing (with a comment explaining why
this is secure).


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] TLB flushing

2018-03-20 Thread Juergen Gross
On 20/03/18 10:56, Jan Beulich wrote:
 On 20.03.18 at 10:29,  wrote:
>> On 20/03/18 10:19, Jan Beulich wrote:
>> On 20.03.18 at 09:50,  wrote:
 While hunting a strange bug in my PCID patch series hinting at some
 TLB invalidation problem I discovered a piece of code looking rather
 fishy to me.

 Is it correct for new_tlbflush_clock_period() to use FLUSH_TLB instead
 of FLUSH_TLB_GLOBAL?

 While not being a problem in current code as both will flush all TLB
 entries my series will change that by using invpcid to flush only the
 non-global entries if FLUSH_TLB_GLOBAL wasn't set.

 I can send a patch if anyone can confirm that using FLUSH_TLB only is
 wrong.
>>>
>>> I think this shouldn't be a separate patch, but an integral part of the
>>> one introducing the distinction between "all" and non-global flushes.
>>> This is because
>>> - right now it doesn't make a difference (we do "all" flushes anyway),
>>> - back in the 32-bit days it didn't matter because guest mappings
>>>   would never have been allowed to be global, and transient Xen
>>>   mappings also would never have had the G bit set.
>>> IOW with what used to be named USER_MAPPINGS_ARE_GLOBAL
>>> this would need to become FLUSH_TLB_GLOBAL at the point the
>>> kind of flush gets altered, while without it could remain at FLUSH_TLB.
>>
>> Really? Aren't global hypervisor mappings affected by this, too?
> 
> Yes and no. The timestamp here is needed to know whether to
> flush when a page gets recycled. As long as G is never set on
> guest controlled mappings for pages which may be recycled,
> there's no issue. In particular, the G bits in the 1:1 mappings are
> of no interest here (and in fact anything Xen maintains which no
> guest can control), as those mappings never go away (or if they
> did, e.g. when a page needs to be offlined for causing #MC, an
> explicit global flush for that page would be required).

I just verified that there is at least one problem in the hypervisor
TLB flushing logic: using invpcid it is possible to flush the non-global
entries only. If I do that in case of FLUSH_TLB_GLOBAL not being set I
get segfaults in user mode of the guest.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 3/7] xen/x86: support per-domain flag for xpti

2018-03-22 Thread Juergen Gross
On 22/03/18 16:26, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> +void xpti_domain_init(struct domain *d)
>> +{
>> +if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
>> +return;
> 
> As you rely on the zero-initialization of the field here, ...
> 
>> +switch ( opt_xpti )
>> +{
>> +case XPTI_OFF:
>> +d->arch.pv_domain.xpti = false;
> 
> ... this could go away as well.

I wanted to make the switch statement complete. No problem to drop
setting of xpti here of you like that better.

> 
>> @@ -1050,8 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>>  panic("Error %d setting up PV root page table\n", rc);
>>  if ( per_cpu(root_pgt, 0) )
>>  {
>> -get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
>> -
>> +get_cpu_info()->pv_cr3 = 0;
>>  /*
>>   * All entry points which may need to switch page tables have to 
>> start
>>   * with interrupts off. Re-write what pv_trap_init() has put there.
> 
> Please don't drop the blank line.

Okay.

> 
>> @@ -36,7 +38,8 @@ static inline void pv_vcpu_destroy(struct vcpu *v) {}
>>  static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
>>  static inline void pv_domain_destroy(struct domain *d) {}
>>  static inline int pv_domain_initialise(struct domain *d) { return 
>> -EOPNOTSUPP; }
>> -
>> +static inline void xpti_init(void) {}
>> +static inline void xpti_domain_init(struct domain *d) {}
>>  #endif  /* CONFIG_PV */
> 
> Same here. With that
> Reviewed-by: Jan Beulich 

Thanks,


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible

2018-03-22 Thread Juergen Gross
On 22/03/18 15:31, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned 
>> int flags)
>>  }
>>  }
>>  
>> +if ( flags & FLUSH_ROOT_PGTBL )
>> +get_cpu_info()->root_pgt_changed = true;
>> +
>>  local_irq_restore(irqfl);
>>  
>>  return flags;
> 
> Does this really need to sit inside the interrupts disabled section?

Hmm, no, I don't think so. I'll move it below local_irq_restore().

> Thinking about it I even wonder whether the cache flush part needs
> to be. Even for the INVLPG portion of the TLB flush part I can't
> seem to see a need for IRQs to be off. I think it's really just the
> pre_flush() / post_flush() pair which needs to be inside such a
> section. I'll prepare a patch (for after 4.11). I think some of the
> changes later in your series will actually further ease this.
> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>>  v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> +if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> + !is_pv_32bit_vcpu(v) )
>> +get_cpu_info()->root_pgt_changed = true;
>>  }
> 
> As this doesn't actually update CR3, setting the flag shouldn't
> generally be necessary if the caller then invokes write_ptbase().
> Isn't setting the flag here needed solely in the case of
> _toggle_guest_pt() being up the call tree? In which case it would
> perhaps better be set there (and in turn some or even all of the
> conditional around it could be dropped)?

Yes, you are right.

> 
>>  void write_ptbase(struct vcpu *v)
>>  {
>> +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +get_cpu_info()->root_pgt_changed = true;
>>  write_cr3(v->arch.cr3);
> 
> When you come here from e.g. __sync_local_execstate(), you
> don't really need to set the flag. Of course you'll come here again
> before the next 64-bit PV vCPU will make it to restore_all_guest,
> so by the time we make it there the flag will be set anyway.
> However, if you already use such a subtlety, then there's also
> no point excluding 32-bit vCPU-s here (nor in make_cr3()), as
> those will never make it to restore_all_guest. Same then for
> excluding HVM vCPU-s. And I then wonder whether (here or
> more likely in a later patch) the root_pgt check couldn't go away
> as well.

I'm not sure this is worth it. Patch 3 will re-introduce a conditional
here and it will look rather different (e.g. without the root_pgt
check). So micro-optimizing this patch barely makes any sense.

> 
>> @@ -3698,18 +3703,29 @@ long do_mmu_update(
>>  break;
>>  rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> -/*
>> - * No need to sync if all uses of the page can be 
>> accounted
>> - * to the page lock we hold, its pinned status, and 
>> uses on
>> - * this (v)CPU.
>> - */
>> -if ( !rc && !cpu_has_no_xpti &&
>> - ((page->u.inuse.type_info & PGT_count_mask) >
>> -  (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>> -   (pagetable_get_pfn(curr->arch.guest_table) == 
>> mfn) 
>> +
>> -   (pagetable_get_pfn(curr->arch.guest_table_user) 
>> ==
>> -mfn))) )
>> -sync_guest = true;
>> +if ( !rc && !cpu_has_no_xpti )
>> +{
>> +bool local_in_use = false;
>> +
>> +if ( (pagetable_get_pfn(curr->arch.guest_table) ==
>> +  mfn) ||
>> + 
>> (pagetable_get_pfn(curr->arch.guest_table_user) ==
>> +  mfn) )
>> +{
>> +local_in_use = true;
>> +get_cpu_info()->root_pgt_changed = true;
>> +}
> 
> The conditional causes root_pgt_changed to get set even in cases
> where what CR3 points to doesn't actually change (if it's the user
> page tables that get modified). I think you want to check
> curr->arch.cr3 here, or only curr->arch.guest_table (as user mode
> can't invoke hypercalls).

I'll go with curr->arch.guest_table.

> 
>> +/*
>> + * No need to sync if all uses of the page can be
>> + * accounted to the page lock we hold, its pinned
>> + * status, and uses on this (v)CPU.
>> + */
>> +

Re: [Xen-devel] [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB

2018-03-22 Thread Juergen Gross
On 22/03/18 16:35, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -1380,6 +1380,14 @@ Because responsibility for APIC setup is shared 
>> between Xen and the
>>  domain 0 kernel this option is automatically propagated to the domain
>>  0 command line.
>>  
>> +### noinvpcid (x86)
>> +> `= `
>> +
>> +Disable using the INVPCID instruction for flushing TLB entries.
>> +This should only be used in case of known issues on the current platform
>> +with that instruction. Disabling INVPCID will normally result in a slightly
>> +degraded performance.
> 
> At the first glance this looks as if it wants to be a cpuid=
> sub-option. However, that would disable use by both Xen and
> (HVM) guests. Andrew, what are your plans here as to
> distinguishing the "Xen uses a feature" from the "disable use of
> a feature altogether"?
> 
> If we stay with a separate option, then please make this a
> normal boolean one (i.e. drop the "no" prefix), as "no-noinvpcid"
> is rather ugly.

Okay.

> 
>> @@ -457,7 +472,6 @@ static void generic_set_all(void)
>>  set_bit(count, _changes_mask);
>>  mask >>= 1;
>>  }
>> -
>>  }
>>  
>>  static void generic_set_mtrr(unsigned int reg, unsigned long base,
> 
> I don't mind this line being dropped, but in general please avoid
> stray changes which aren't assimilated into changes you do anyway.

The main reason I did drop this line was the trailing tab. I just took
the risk of someone complaining.


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-22 Thread Juergen Gross
On 22/03/18 17:30, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
> 
> I continue to be not entirely convinced of this move. I had an
> alternative in mind: Since retaining global pages is particularly
> relevant for switches between guest user and guest kernel
> modes, what if we made a shortcut from e.g. lstar_enter through
> switch_to_kernel to restore_all_guest without ever switching to
> the full page Xen tables?

With patch 7 of this series in mind I'm not convinced the extra effort
is really making sense. Today most processors do have PCID support so
for that old hardware I don't think we need to make the handling even
more complex.

> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>  struct cpu_info *cpu_info = get_cpu_info();
>> +unsigned long new_cr4;
>> +
>> +new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>> +  ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
> 
> I'm not overly happy to see any new uses of mmu_cr4_features.
> This should really only be used for priming certain values imo,
> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
> so too, and perhaps better wouldn't). Hence I wonder whether
> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
> because we've just got rid of the blanket reversion to
> mmu_cr4_features in VMX code.

I do understand that using mmu_cr4_features isn't the best way to set
cr4. But I think it is a good idea to have a default value which should
normally be used instead of only switching various bits on and off.

In case cr4 is loaded with a strange value in some corner case that
value might be used from then on instead of being repaired by loading a
dedicated value at certain points in time, e.g. when doing a context
switch.

So maybe we should introduce cr4_default which is derived from
mmu_cr4_features? mmu_cr4_features would contain all bits which are
allowed on the current processor with the current command line options,
while cr4_default would be a subset of mmu_cr4_features.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 3/7] xen/x86: support per-domain flag for xpti

2018-03-22 Thread Juergen Gross
On 22/03/18 16:44, Jan Beulich wrote:
 On 22.03.18 at 16:29,  wrote:
>> On 22/03/18 16:26, Jan Beulich wrote:
>> On 21.03.18 at 13:51,  wrote:
 +void xpti_domain_init(struct domain *d)
 +{
 +if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
 +return;
>>>
>>> As you rely on the zero-initialization of the field here, ...
>>>
 +switch ( opt_xpti )
 +{
 +case XPTI_OFF:
 +d->arch.pv_domain.xpti = false;
>>>
>>> ... this could go away as well.
>>
>> I wanted to make the switch statement complete. No problem to drop
>> setting of xpti here of you like that better.
> 
> FAOD I didn't mean dropping the entire case block.

Of course not. This would just be wrong with the current default block.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 1/4] x86/PVHv2: Add memory map pointer to hvm_start_info struct

2018-03-21 Thread Juergen Gross
On 21/03/18 10:28, Roger Pau Monné wrote:
> On Tue, Mar 20, 2018 at 09:48:56AM -0700, Maran Wilson wrote:
>> The start info structure that is defined as part of the x86/HVM direct boot
>> ABI and used for starting Xen PVH guests would be more versatile if it also
>> included a way to pass information about the memory map to the guest. This
>> would allow KVM guests to share the same entry point.
>>
>> Signed-off-by: Maran Wilson 
> 
> Reviewed-by: Roger Pau Monné 
> 
> Just a couple of nit suggestions...
> 
>> ---
>>  xen/include/public/arch-x86/hvm/start_info.h | 65 
>> +++-
>>  1 file changed, 64 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/include/public/arch-x86/hvm/start_info.h 
>> b/xen/include/public/arch-x86/hvm/start_info.h
>> index 6484159..d491f2d 100644
>> --- a/xen/include/public/arch-x86/hvm/start_info.h
>> +++ b/xen/include/public/arch-x86/hvm/start_info.h
>> @@ -33,7 +33,7 @@
>>   *| magic  | Contains the magic value XEN_HVM_START_MAGIC_VALUE
>>   *|| ("xEn3" with the 0x80 bit of the "E" set).
>>   *  4 ++
>> - *| version| Version of this structure. Current version is 0. 
>> New
>> + *| version| Version of this structure. Current version is 1. 
>> New
>>   *|| versions are guaranteed to be backwards-compatible.
>>   *  8 ++
>>   *| flags  | SIF_xxx flags.
>> @@ -48,6 +48,15 @@
>>   * 32 ++
>>   *| rsdp_paddr | Physical address of the RSDP ACPI data structure.
>>   * 40 ++
>> + *| memmap_paddr   | Physical address of the (optional) memory map. Only
>> + *|| present in version 1 and newer of the structure.
>> + * 48 ++
>> + *| memmap_entries | Number of entries in the memory map table. Zero
>> + *|| if there is no memory map being provided. Only
>> + *|| present in version 1 and newer of the structure.
>> + * 52 ++
>> + *| reserved   | Version 1 and newer only.
>> + * 56 ++
>>   *
>>   * The layout of each entry in the module structure is the following:
>>   *
>> @@ -62,14 +71,53 @@
>>   *| reserved   |
>>   * 32 ++
>>   *
>> + * The layout of each entry in the memory map table is as follows:
>> + *
>> + *  0 ++
>> + *| addr   | Base address
>> + *  8 ++
>> + *| size   | Size of mapping in bytes
>> + * 16 ++
>> + *| type   | Type of mapping as defined between the hypervisor
>> + *|| and guest it's starting. See XEN_HVM_MEMMAP_TYPE_*
> 
> I would remove "it's starting" here.
> 
>> + *|| values below.
>> + * 20 +|
>> + *| reserved   |
>> + * 24 ++
>> + *
>>   * The address and sizes are always a 64bit little endian unsigned integer.
>>   *
>>   * NB: Xen on x86 will always try to place all the data below the 4GiB
>>   * boundary.
>> + *
>> + * Version numbers of the hvm_start_info structure have evolved like this:
>> + *
>> + * Version 0:  Initial implementation.
>> + *
>> + * Version 1:  Added the memmap_paddr/memmap_entries fields (plus 4 bytes of
>> + * padding) to the end of the hvm_start_info struct. These new
>> + * fields can be used to pass a memory map to the guest. The
>> + * memory map is optional and so guests that understand version 
>> 1
>> + * of the structure must check that memmap_entries is non-zero
>> + * before trying to read the memory map.
>>   */
>>  #define XEN_HVM_START_MAGIC_VALUE 0x336ec578
>>  
>>  /*
>> + * The values used in the type field of the memory map table entries are
>> + * defined below and match the Address Range Types as defined in the "System
>> + * Address Map Interfaces" section of the ACPI Specification. Please refer 
>> to
>> + * section 15 in version 6.2 of the ACPI spec: 
>> http://uefi.org/specifications
>> + */
>> +#define XEN_HVM_MEMMAP_TYPE_RAM   1
>> +#define XEN_HVM_MEMMAP_TYPE_RESERVED  2
>> +#define XEN_HVM_MEMMAP_TYPE_ACPI  3
>> +#define XEN_HVM_MEMMAP_TYPE_NVS   4
>> +#define XEN_HVM_MEMMAP_TYPE_UNUSABLE  5
>> +#define XEN_HVM_MEMMAP_TYPE_DISABLED  6
>> +#define XEN_HVM_MEMMAP_TYPE_PMEM  7
>> +
>> +/*
>>   * C representation of the x86/HVM start info layout.
>>   *
>>   * The canonical definition of this layout is above, this is just a way to
>> @@ -86,6 +134,14 @@ struct hvm_start_info {
>>  uint64_t cmdline_paddr; /* Physical address of the command line.
>>  */
>>  uint64_t rsdp_paddr;/* Physical address of the RSDP ACPI data   
>>  */
>>  /* structure.   
>>  */
>> +uint64_t memmap_paddr;  /* Physical address of an array of   

Re: [Xen-devel] [PATCH v4 7/8] x86: also NOP out xen_cr3 restores of XPTI

2018-03-21 Thread Juergen Gross
On 19/03/18 14:41, Jan Beulich wrote:
> ... despite quite likely the gain being rather limited.
> 
> Signed-off-by: Jan Beulich <jbeul...@suse.com>

Tested-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 4/8] x86/XPTI: use %r12 to write zero into xen_cr3

2018-03-21 Thread Juergen Gross
On 19/03/18 14:39, Jan Beulich wrote:
> Now that we zero all registers early on all entry paths, use that to
> avoid a couple of immediates here.
> 
> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> Acked-by: Andrew Cooper <andrew.coop...@citrix.com>

Tested-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 5/8] x86/XPTI: reduce .text.entry

2018-03-21 Thread Juergen Gross
On 19/03/18 14:40, Jan Beulich wrote:
> This exposes less code pieces and at the same time reduces the range
> covered from slightly above 3 pages to a little below 2 of them.
> 
> The code being moved is unchanged, except for the removal of trailing
> blanks, insertion of blanks between operands, and a pointless q suffix
> from "retq".
> 
> A few more small pieces could be moved, but it seems better to me to
> leave them where they are to not make it overly hard to follow code
> paths.
> 
> Signed-off-by: Jan Beulich <jbeul...@suse.com>

Tested-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 04/20] xen/domctl: Drop vcpu_alloc_lock

2018-03-20 Thread Juergen Gross
On 20/03/18 18:22, Andrew Cooper wrote:
> On 20/03/18 16:58, Jan Beulich wrote:
> On 19.03.18 at 20:13,  wrote:
>>> It is not entirely clear why this interlock was introduced in c/s 8cbb5278e
>>> "x86/AMD: Add support for AMD's OSVW feature in guests".
>>>
>>> At the time, svm_handle_osvw() could have seen an unexpected change in OSVW
>>> (not the case now, due to the new CPUID Policy infrastructure), but even 
>>> then,
>>> it would have caused spurious changes in behaviour when handling
>>> OSVW_{ID_LENGTH,STATUS} read requests on behalf of an already-running guest.
>>>
>>> There are plenty of other aspects of domain creation which depend on 
>>> hardware
>>> details which may change across a microcode load, but where not protected by
>>> this interlock.
>> Are there? We don't re-read CPUID (yet), for example. But of
>> course it is also not really specified which aspects may change
>> across microcode updates.
>>
>>> A host administrator choosing to perform late microcode loading has plenty 
>>> of
>>> other problems to worry about, and is it not unreasonable to expect them to
>>> temporarily cease domain construction activities while the microcode loading
>>> is in progress.
>> But it is also not unreasonable to expect the hypervisor to guard
>> against inconsistencies here. On the whole I'm not really
>> convinced; I think I'd like to hear others' opinions.
> 
> The underlying problem is that this lock cannot say when merging
> max_cpus into createdomain, because we cannot continue the hypercall
> midway through.
> 
> As it doesn't currently protect createdomain, which amongst other things
> contains init_domain_cpuid_policy() and init_domain_msr_policy() (the
> most likely structures to be affected by microcode updates), I don't see
> any purpose in keeping it for the minute area it does cover.

What about failing domain creation e.g. via -EAGAIN in case a
microcode update happened in between? This would be easy by adding a
microcode generation count which would have to be the same for start
and end of the create domain hypercall.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 1/8] x86: NOP out XPTI entry/exit code when it's not in use

2018-03-21 Thread Juergen Gross
On 19/03/18 14:37, Jan Beulich wrote:
> Introduce a synthetic feature flag to use alternative instruction
> patching to NOP out all code on entry/exit paths. Having NOPs here is
> generally better than using conditional branches.
> 
> Also change the limit on the number of bytes we can patch in one go to
> that resulting from the encoding in struct alt_instr - there's no point
> reducing it below that limit, and without a check being in place that
> the limit isn't actually exceeded, such an artificial boundary is a
> latent risk.
> 
> Signed-off-by: Jan Beulich <jbeul...@suse.com>

Tested-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible

2018-03-21 Thread Juergen Gross
For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag whether the copying should be performed and set
that flag only when loading a new %cr3 or modifying the L4 page table.
This includes synchronization of the cpu local root page table with
other cpus, so add a special synchronization flag for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- set flag locally only if affected L4 is active (Jan Beulich)
- add setting flag to flush_area_mask() (Jan Beulich)
- set flag in make_cr3() only if called for current active vcpu

To be applied on top of Jan's "Meltdown band-aid overhead reduction"
series
---
 xen/arch/x86/flushtlb.c   |  3 +++
 xen/arch/x86/mm.c | 42 +++
 xen/arch/x86/smp.c|  2 +-
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/arch/x86/x86_64/entry.S   |  8 ++--
 xen/include/asm-x86/current.h |  8 
 xen/include/asm-x86/flushtlb.h|  2 ++
 7 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 8a7a76b8ff..9a9af71d5a 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned int 
flags)
 }
 }
 
+if ( flags & FLUSH_ROOT_PGTBL )
+get_cpu_info()->root_pgt_changed = true;
+
 local_irq_restore(irqfl);
 
 return flags;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f0195561c2..352600ad73 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
 v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
+if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
+ !is_pv_32bit_vcpu(v) )
+get_cpu_info()->root_pgt_changed = true;
 }
 
 void write_ptbase(struct vcpu *v)
 {
+if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+get_cpu_info()->root_pgt_changed = true;
 write_cr3(v->arch.cr3);
 }
 
@@ -3698,18 +3703,29 @@ long do_mmu_update(
 break;
 rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
   cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-/*
- * No need to sync if all uses of the page can be accounted
- * to the page lock we hold, its pinned status, and uses on
- * this (v)CPU.
- */
-if ( !rc && !cpu_has_no_xpti &&
- ((page->u.inuse.type_info & PGT_count_mask) >
-  (1 + !!(page->u.inuse.type_info & PGT_pinned) +
-   (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
-   (pagetable_get_pfn(curr->arch.guest_table_user) ==
-mfn))) )
-sync_guest = true;
+if ( !rc && !cpu_has_no_xpti )
+{
+bool local_in_use = false;
+
+if ( (pagetable_get_pfn(curr->arch.guest_table) ==
+  mfn) ||
+ (pagetable_get_pfn(curr->arch.guest_table_user) ==
+  mfn) )
+{
+local_in_use = true;
+get_cpu_info()->root_pgt_changed = true;
+}
+
+/*
+ * No need to sync if all uses of the page can be
+ * accounted to the page lock we hold, its pinned
+ * status, and uses on this (v)CPU.
+ */
+if ( (page->u.inuse.type_info & PGT_count_mask) >
+ (1 + !!(page->u.inuse.type_info & PGT_pinned) +
+  local_in_use) )
+sync_guest = true;
+}
 break;
 
 case PGT_writable_page:
@@ -3824,7 +3840,7 @@ long do_mmu_update(
 
 cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
 if ( !cpumask_empty(m

[Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-21 Thread Juergen Gross
Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- support PCID for non-XPTI case, too
- add command line parameter for controlling usage of PCID
- check PCID active by using cr4.pcide (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 12 +
 xen/arch/x86/debug.c|  3 ++-
 xen/arch/x86/domain_page.c  |  2 +-
 xen/arch/x86/domctl.c   |  4 +++
 xen/arch/x86/flushtlb.c | 49 --
 xen/arch/x86/mm.c   | 34 +---
 xen/arch/x86/pv/dom0_build.c|  1 +
 xen/arch/x86/pv/domain.c| 52 +
 xen/include/asm-x86/domain.h| 14 +++---
 xen/include/asm-x86/pv/domain.h |  2 ++
 xen/include/asm-x86/x86-defns.h |  1 +
 11 files changed, 158 insertions(+), 16 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index 8fc7b2ff3b..4ecf471ea9 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1450,6 +1450,18 @@ All numbers specified must be hexadecimal ones.
 
 This option can be specified more than once (up to 8 times at present).
 
+### pcid (x86)
+> `= off | all | xpti | noxpti`
+
+> Default: `xpti`
+
+> Can be modified at runtime
+
+If available, control usage of the PCID feature of the processor for
+64-bit pv-domains. PCID can be used either for no domain at all, for
+all of them, only for those subject to XPTI or for those not subject
+to XPTI.
+
 ### ple\_gap
 > `= `
 
diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index 9159f32db4..c8079569c4 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t 
pgd3val)
 l3_pgentry_t l3e, *l3t;
 l2_pgentry_t l2e, *l2t;
 l1_pgentry_t l1e, *l1t;
-unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
+unsigned long cr3 = (pgd3val ? pgd3val
+ : (dp->vcpu[0]->arch.cr3 & ~X86_CR3_NOFLUSH));
 mfn_t mfn = maddr_to_mfn(cr3);
 
 DBGP2("vaddr:%lx domid:%d cr3:%lx pgd3:%lx\n", vaddr, dp->domain_id, 
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index b5780f201f..8073ae5282 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
 if ( (v = idle_vcpu[smp_processor_id()]) == current )
 sync_local_execstate();
 /* We must now be running on the idle page table. */
-ASSERT(read_cr3() == __pa(idle_pg_table));
+ASSERT((read_cr3() & ~X86_CR3_PCIDMASK) == __pa(idle_pg_table));
 }
 
 return v;
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 0704f398c7..a7c8772fa6 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -613,7 +613,11 @@ long arch_do_domctl(
 ret = -EINVAL;
 
 if ( ret == 0 )
+{
 xpti_domain_init(d);
+pcid_domain_init(d);
+}
+
 break;
 
 case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index d4b8acc837..092ef86314 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -102,7 +102,19 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
 t = pre_flush();
 
 if ( read_cr4() & X86_CR4_PGE )
+/*
+ * X86_CR4_PGE set means PCID being inactive.
+ * We have to purge the TLB via flipping cr4.pge.
+ */
 write_cr4(cr4 & ~X86_CR4_PGE);
+else if ( cpu_has_invpcid )
+/*
+ * If we are using PCID purge the TLB via INVPCID as loading cr3
+ * will affect the current PCID only.
+ * If INVPCID is not supported we don't use PCIDs so loading cr3
+ * will purge the TLB (we are in the "global pages off" branch).
+ */
+invpcid_flush_all();
 
 asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
 
@@ -131,14 +143,35 @@ unsigned int flush_area_local(const void *va, unsigned 
int flags)
 {
 if ( order == 0 )

[Xen-devel] [PATCH v3 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid

2018-03-21 Thread Juergen Gross
Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- renamed use_xen_cr3 to better fitting use_pv_cr3
- corrected comment regarding semantics of use_pv_cr3 (Jan Beulich)
- prefer 32-bit operations over 8- or 16-bit ones (Jan Beulich)
---
 xen/arch/x86/domain.c  |  1 +
 xen/arch/x86/mm.c  |  3 +-
 xen/arch/x86/smpboot.c |  2 ++
 xen/arch/x86/x86_64/asm-offsets.c  |  1 +
 xen/arch/x86/x86_64/compat/entry.S |  5 ++-
 xen/arch/x86/x86_64/entry.S| 67 +-
 xen/include/asm-x86/current.h  | 12 ---
 7 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index fe18d9d3c4..6f3e29c677 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1693,6 +1693,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
 ASSERT(local_irq_is_enabled());
 
+get_cpu_info()->use_pv_cr3 = false;
 get_cpu_info()->xen_cr3 = 0;
 
 if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c2738dd3fd..29071bb257 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -521,7 +521,8 @@ void write_ptbase(struct vcpu *v)
 }
 else
 {
-/* Make sure to clear xen_cr3 before pv_cr3. */
+/* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */
+cpu_info->use_pv_cr3 = false;
 cpu_info->xen_cr3 = 0;
 /* write_cr3_cr4() serializes. */
 write_cr3_cr4(v->arch.cr3, new_cr4);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 07624da32b..8fa24cead8 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -330,6 +330,7 @@ void start_secondary(void *unused)
  */
 spin_debug_disable();
 
+get_cpu_info()->use_pv_cr3 = false;
 get_cpu_info()->xen_cr3 = 0;
 get_cpu_info()->pv_cr3 = 0;
 
@@ -1128,6 +1129,7 @@ void __init smp_prepare_boot_cpu(void)
 per_cpu(scratch_cpumask, cpu) = _cpu0mask;
 #endif
 
+get_cpu_info()->use_pv_cr3 = false;
 get_cpu_info()->xen_cr3 = 0;
 get_cpu_info()->pv_cr3 = 0;
 }
diff --git a/xen/arch/x86/x86_64/asm-offsets.c 
b/xen/arch/x86/x86_64/asm-offsets.c
index 9e2aefb00f..7ad024cf37 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -144,6 +144,7 @@ void __dummy__(void)
 OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, 
use_shadow_spec_ctrl);
 OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
 OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
+OFFSET(CPUINFO_use_pv_cr3, struct cpu_info, use_pv_cr3);
 DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
 BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S 
b/xen/arch/x86/x86_64/compat/entry.S
index 6d2a14eacf..a18598f106 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -211,10 +211,9 @@ ENTRY(cstar_enter)
 GET_STACK_END(bx)
 .Lcstar_cr3_start:
 mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-neg   %rcx
+test  %rcx, %rcx
 jz.Lcstar_cr3_okay
-mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-neg   %rcx
+movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
 mov   %rcx, %cr3
 /* %r12 is still zero at this point. */
 mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index eb33ec835a..f51d091c23 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -155,6 +155,7 @@ restore_all_guest:
 rep movsq
 .Lrag_copy_done:
 mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+movb  $1, STACK_CPUINFO_FIELD(use_pv_cr3)(%rdx)
 mov   %rax, %cr3
 .Lrag_cr3_end:
 ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
@@ -205,14 +206,9 @@ restore_all_xen:
  */
 GET_STACK_END(bx)
 .Lrax_cr3_start:
-mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+cmpb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
+UNLIKELY_START(ne, exit_cr3)
 mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
-test  %rdx, %rdx
-/*
- * Ideally the condition would be "nsz", but such doesn't exist,
- * so "g" w

[Xen-devel] [PATCH v3 0/7] xen/x86: various XPTI speedups

2018-03-21 Thread Juergen Gross
This patch series aims at reducing the overhead of the XPTI Meltdown
mitigation. It is based on Jan's XPTI speedup series.

Patch 1 had been posted before, the main changes in this patch are due
to addressing Jan's comments on my first version. The main objective of
that patch is to avoid copying the L4 page table each time the guest is
being activated, as often the contents didn't change while the
hypervisor was active.

Patch 2 tries to minimize flushing the TLB: there is no need to flush
it in write_ptbase() and when activating the guest.

Patch 3 sets the stage for being able to activate XPTI per domain. As a
first step it is now possible to switch XPTI off for dom0 via the xpti
boot parameter.

Patch 4 adds support for using the INVPCID instruction for flushing
the TLB.

Patch 5 reduces the costs of TLB flushes even further: as we don't make
any use of global TLB entries with XPTI being active we can avoid
removing all global TLB entries on TLB flushes by simply deactivating
the global pages in CR4.

Patch 6 was originally only meant to prepare using PCIDs in patch 6.
For that purpose it was necessary to allow CR3 values with bit 63 set
in order to avoid flushing TLB entries when writing CR3. This requires
a modification of Jan's rather clever state machine with positive and
negative CR3 values for the hypervisor by using a dedicated flag byte
instead. It turned out this modification saved one branch on interrupt
entry speeding up the handling by a few percent.

Patch 7 is the main performance contributor: by making use of the PCID
feature (if available) TLB entries can survive CR3 switches. The TLB
needs to be flushed on context switches only and not when switching
between guest and hypervisor or guest kernel and user mode.

On my machine (Intel i7-4600M) using the PCID feature in the non-XPTI
case showed a slightly worse performance than using global pages
instead (using PCID and global pages is a bad idea as invalidating
global pages in this case would need a complete TLB flush). For this
reason I've decided to use PCID for XPTI only as the default. That
can easily be changed by using the command line parameter "pcid=all".

The complete series has been verified to still mitigate against
Meltdown attacks. A simple performance test (make -j 4 in the Xen
hypervisor directory) showed significant improvements compared to the
state without this series (so with Jan's series applied),
the percentage after the numbers is always related to XPTI off:

   XPTI off Jan, XPTI on+this series, XPTI on
real   1m21.169s1m52.149s (+38%)1m25.692s (+6%)
user   2m47.652s2m50.054s (+1%) 2m46.428s (-1%)
sys1m11.949s2m21.767s (+97%)1m23.053s (+15%)

A git branch of that series (+ Jan's patches) is available:

https://github.com/jgross1/xen.git xpti


Juergen Gross (7):
  x86/xpti: avoid copying L4 page table contents when possible
  x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  xen/x86: support per-domain flag for xpti
  xen/x86: use invpcid for flushing the TLB
  xen/x86: disable global pages for domains with XPTI active
  xen/x86: use flag byte for decision whether xen_cr3 is valid
  xen/x86: use PCID feature

 docs/misc/xen-command-line.markdown |  30 +++-
 xen/arch/x86/cpu/mtrr/generic.c |  37 +++---
 xen/arch/x86/debug.c|   3 +-
 xen/arch/x86/domain.c   |   6 +-
 xen/arch/x86/domain_page.c  |   2 +-
 xen/arch/x86/domctl.c   |   8 +++
 xen/arch/x86/flushtlb.c |  96 +++---
 xen/arch/x86/mm.c   |  91 +
 xen/arch/x86/pv/dom0_build.c|   4 ++
 xen/arch/x86/pv/domain.c| 132 +++-
 xen/arch/x86/setup.c|  27 +++-
 xen/arch/x86/smp.c  |   2 +-
 xen/arch/x86/smpboot.c  |   7 +-
 xen/arch/x86/x86_64/asm-offsets.c   |   2 +
 xen/arch/x86/x86_64/compat/entry.S  |   5 +-
 xen/arch/x86/x86_64/entry.S |  87 ++--
 xen/common/efi/runtime.c|   4 +-
 xen/include/asm-x86/current.h   |  23 +--
 xen/include/asm-x86/domain.h|  20 --
 xen/include/asm-x86/flushtlb.h  |   4 +-
 xen/include/asm-x86/pv/domain.h |   7 +-
 xen/include/asm-x86/x86-defns.h |   1 +
 22 files changed, 453 insertions(+), 145 deletions(-)

-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 3/7] xen/x86: support per-domain flag for xpti

2018-03-21 Thread Juergen Gross
Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- latch get_cpu_info() return value in variable (Jan Beulich)
- call always xpti_domain_init() for pv dom0 (Jan Beulich)
- add __init annotations (Jan Beulich)
- drop per domain XPTI message (Jan Beulich)
- document xpti=default support (Jan Beulich)
- move domain xpti flag into a padding hole (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 10 -
 xen/arch/x86/domctl.c   |  4 ++
 xen/arch/x86/mm.c   | 10 -
 xen/arch/x86/pv/dom0_build.c|  3 ++
 xen/arch/x86/pv/domain.c| 80 -
 xen/arch/x86/setup.c| 20 +-
 xen/arch/x86/smpboot.c  |  5 +--
 xen/arch/x86/x86_64/entry.S |  2 +
 xen/include/asm-x86/current.h   |  3 +-
 xen/include/asm-x86/domain.h|  3 ++
 xen/include/asm-x86/pv/domain.h |  5 ++-
 11 files changed, 118 insertions(+), 27 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index b353352adf..79be9a6ba5 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1955,7 +1955,7 @@ clustered mode.  The default, given no hint from the 
**FADT**, is cluster
 mode.
 
 ### xpti
-> `= `
+> `= default | nodom0 | `
 
 > Default: `false` on AMD hardware
 > Default: `true` everywhere else
@@ -1963,6 +1963,14 @@ mode.
 Override default selection of whether to isolate 64-bit PV guest page
 tables.
 
+`true` activates page table isolation even on AMD hardware.
+
+`false` deactivates page table isolation on all systems.
+
+`default` sets the default behaviour.
+
+`nodom0` deactivates page table isolation for dom0.
+
 ### xsave
 > `= `
 
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 8fbbf3aeb3..0704f398c7 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include  /* for hvm_acpi_power_button */
 #include  /* for arch_do_domctl */
 #include 
@@ -610,6 +611,9 @@ long arch_do_domctl(
 ret = switch_compat(d);
 else
 ret = -EINVAL;
+
+if ( ret == 0 )
+xpti_domain_init(d);
 break;
 
 case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 8c944b33c9..e68ed474bf 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -507,14 +507,20 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+struct cpu_info *cpu_info = get_cpu_info();
+
+if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
 {
-get_cpu_info()->root_pgt_changed = true;
+cpu_info->root_pgt_changed = true;
+cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
 asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
 }
 else
 {
+/* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+cpu_info->xen_cr3 = 0;
 write_cr3(v->arch.cr3);
+cpu_info->pv_cr3 = 0;
 }
 }
 
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 0bd2f1bf90..77186c19bd 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -707,6 +708,8 @@ int __init dom0_construct_pv(struct domain *d,
 cpu = p->processor;
 }
 
+xpti_domain_init(d);
+
 d->arch.paging.mode = 0;
 
 /* Set up CR3 value for write_ptbase */
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2524326b74..266117e804 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -9,6 +9,8 @@
 #include 
 #include 
 
+#include 
+#include 
 #include 
 
 /* Override macros from asm/page.h to make them work with mfn_t */
@@ -17,6 +19,82 @@
 #undef page_to_mfn
 #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
 
+static __read_mostly enum {
+XPTI_DEFAULT,
+XPTI_ON,
+XPTI_OFF,
+XPTI_NODOM0
+} opt_xpti = XPTI_DEFAULT;
+
+static __init int parse_xpti(const char *s)
+{
+int rc = 0;
+
+switch ( parse_bool(s, NULL) )
+{
+case 0:
+opt_xpti = XPTI_OFF;
+break;
+case 1:
+opt_xpti = XPTI_ON;
+break;
+default:
+if ( !strcmp(s, "default") )
+opt_xpti = XPTI_DEFAULT;
+else if ( !strcmp(s, "nodom0") )
+  

[Xen-devel] [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB

2018-03-21 Thread Juergen Gross
If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "noinvpcid" for deactivating the use of
INVPCID.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- new patch
---
 docs/misc/xen-command-line.markdown |  8 
 xen/arch/x86/cpu/mtrr/generic.c | 37 ++---
 xen/arch/x86/flushtlb.c | 31 +--
 xen/arch/x86/setup.c|  7 +++
 4 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index 79be9a6ba5..8fc7b2ff3b 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1380,6 +1380,14 @@ Because responsibility for APIC setup is shared between 
Xen and the
 domain 0 kernel this option is automatically propagated to the domain
 0 command line.
 
+### noinvpcid (x86)
+> `= `
+
+Disable using the INVPCID instruction for flushing TLB entries.
+This should only be used in case of known issues on the current platform
+with that instruction. Disabling INVPCID will normally result in a slightly
+degraded performance.
+
 ### noirqbalance
 > `= `
 
diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
index e9c0e5e059..e88643f4bf 100644
--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -400,8 +401,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
  * has been called.
  */
 
-static void prepare_set(void)
+static bool prepare_set(void)
 {
+   unsigned long cr4;
+
/*  Note that this is not ideal, since the cache is only 
flushed/disabled
   for this CPU while the MTRRs are changed, but changing this requires
   more invasive changes to the way the kernel boots  */
@@ -412,18 +415,24 @@ static void prepare_set(void)
write_cr0(read_cr0() | X86_CR0_CD);
wbinvd();
 
-   /*  TLB flushing here relies on Xen always using CR4.PGE. */
-   BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
-   write_cr4(read_cr4() & ~X86_CR4_PGE);
+   cr4 = read_cr4();
+   if (cr4 & X86_CR4_PGE)
+   write_cr4(cr4 & ~X86_CR4_PGE);
+   else if (cpu_has_invpcid)
+   invpcid_flush_all();
+   else
+   asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
/*  Save MTRR state */
rdmsrl(MSR_MTRRdefType, deftype);
 
/*  Disable MTRRs, and set the default type to uncached  */
mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
+
+   return cr4 & X86_CR4_PGE;
 }
 
-static void post_set(void)
+static void post_set(bool pge)
 {
/* Intel (P6) standard MTRRs */
mtrr_wrmsr(MSR_MTRRdefType, deftype);
@@ -432,7 +441,12 @@ static void post_set(void)
write_cr0(read_cr0() & ~X86_CR0_CD);
 
/*  Reenable CR4.PGE (also flushes the TLB) */
-   write_cr4(read_cr4() | X86_CR4_PGE);
+   if (pge)
+   write_cr4(read_cr4() | X86_CR4_PGE);
+   else if (cpu_has_invpcid)
+   invpcid_flush_all();
+   else
+   asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
spin_unlock(_atomicity_lock);
 }
@@ -441,14 +455,15 @@ static void generic_set_all(void)
 {
unsigned long mask, count;
unsigned long flags;
+   bool pge;
 
local_irq_save(flags);
-   prepare_set();
+   pge = prepare_set();
 
/* Actually set the state */
mask = set_mtrr_state();
 
-   post_set();
+   post_set(pge);
local_irq_restore(flags);
 
/*  Use the atomic bitops to update the global mask  */
@@ -457,7 +472,6 @@ static void generic_set_all(void)
set_bit(count, _changes_mask);
mask >>= 1;
}
-   
 }
 
 static void generic_set_mtrr(unsigned int reg, unsigned long base,
@@ -474,11 +488,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned 
long base,
 {
unsigned long flags;
struct mtrr_var_range *vr;
+   bool pge;
 
vr = _state.var_ranges[reg];
 
local_irq_save(flags);
-   prepare_set();
+   pge = prepare_set();
 
if (size == 0) {
/* The invalid bit is kept in the mask, so we simply clear the
@@ -499,7 +514,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned 
long base,
mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
}
 
-   post_set();
+   post_set(pge);
local_irq_restore(flags);
 }
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/fl

[Xen-devel] [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context

2018-03-21 Thread Juergen Gross
When switching to a 64-bit pv context the TLB is flushed twice today:
the first time when switching to the new address space in
write_ptbase(), the second time when switching to guest mode in
restore_to_guest.

Avoid the first TLB flush in that case.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- omit setting root_pgt_changed to false (Jan Beulich)
---
 xen/arch/x86/mm.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 352600ad73..8c944b33c9 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -123,6 +123,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -507,8 +508,14 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 void write_ptbase(struct vcpu *v)
 {
 if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+{
 get_cpu_info()->root_pgt_changed = true;
-write_cr3(v->arch.cr3);
+asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+}
+else
+{
+write_cr3(v->arch.cr3);
+}
 }
 
 /*
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-21 Thread Juergen Gross
Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to write_cr3_cr4() called by write_ptbase()).

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
V3:
- move cr4 loading for all domains from *_ctxt_switch_to() to
  write_cr3_cr4() called by write_ptbase() (Jan Beulich)
- rebase
---
 xen/arch/x86/domain.c  |  5 -
 xen/arch/x86/flushtlb.c| 13 -
 xen/arch/x86/mm.c  | 11 ---
 xen/arch/x86/x86_64/entry.S| 10 --
 xen/common/efi/runtime.c   |  4 ++--
 xen/include/asm-x86/domain.h   |  3 ++-
 xen/include/asm-x86/flushtlb.h |  2 +-
 7 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 4cac8906ea..fe18d9d3c4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1520,17 +1520,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
 void paravirt_ctxt_switch_to(struct vcpu *v)
 {
 root_pgentry_t *root_pgt = this_cpu(root_pgt);
-unsigned long cr4;
 
 if ( root_pgt )
 root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
 l4e_from_page(v->domain->arch.perdomain_l3_pg,
   __PAGE_HYPERVISOR_RW);
 
-cr4 = pv_guest_cr4_to_real_cr4(v);
-if ( unlikely(cr4 != read_cr4()) )
-write_cr4(cr4);
-
 if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
 activate_debugregs(v);
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 6345676bc5..d4b8acc837 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -91,20 +91,23 @@ static void do_tlb_flush(void)
 post_flush(t);
 }
 
-void write_cr3(unsigned long cr3)
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
 {
-unsigned long flags, cr4;
+unsigned long flags;
 u32 t;
 
 /* This non-reentrant function is sometimes called in interrupt context. */
 local_irq_save(flags);
 
 t = pre_flush();
-cr4 = read_cr4();
 
-write_cr4(cr4 & ~X86_CR4_PGE);
+if ( read_cr4() & X86_CR4_PGE )
+write_cr4(cr4 & ~X86_CR4_PGE);
+
 asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
-write_cr4(cr4);
+
+if ( read_cr4() != cr4 )
+write_cr4(cr4);
 
 post_flush(t);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e68ed474bf..c2738dd3fd 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 void write_ptbase(struct vcpu *v)
 {
 struct cpu_info *cpu_info = get_cpu_info();
+unsigned long new_cr4;
+
+new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
+  ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
 
 if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
 {
 cpu_info->root_pgt_changed = true;
 cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
-asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+write_cr3_cr4(v->arch.cr3, new_cr4);
 }
 else
 {
-/* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+/* Make sure to clear xen_cr3 before pv_cr3. */
 cpu_info->xen_cr3 = 0;
-write_cr3(v->arch.cr3);
+/* write_cr3_cr4() serializes. */
+write_cr3_cr4(v->arch.cr3, new_cr4);
 cpu_info->pv_cr3 = 0;
 }
 }
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 2a06cd1a51..eb33ec835a 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -154,13 +154,8 @@ restore_all_guest:
 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
 rep movsq
 .Lrag_copy_done:
-mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
 mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
-mov   %rdi, %rsi
-and   $~X86_CR4_PGE, %rdi
-mov   %rdi, %cr4
 mov   %rax, %cr3
-mov   %rsi, %cr4
 .Lrag_cr3_end:
 ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
 
@@ -218,12 +213,7 @@ restore_all_xen:
  * so "g" will have to do.
  */
 UNLIKELY_START(g, exit_cr3)
-mov   %cr4, %rdi
-mov   %rdi, %rsi
-and   $~X86_CR4_PGE, %rdi
-mov   %rdi, %cr4
 mov   %rax, %cr3
-mov   %rsi, %cr4
 UNLIKELY_END(exit_cr3)
 .Lrax_cr3_end:
 ALTERNATIVE_NOP .Lrax_cr3_start, .Lrax_cr3_end, X86_FEATURE_NO_XPTI
diff --git a/xen/common/efi/runtime.c b/xe

Re: [Xen-devel] [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-23 Thread Juergen Gross
On 23/03/18 08:46, Jan Beulich wrote:
 On 22.03.18 at 19:18,  wrote:
>> On 22/03/18 17:30, Jan Beulich wrote:
>> On 21.03.18 at 13:51,  wrote:
 Instead of flushing the TLB from global pages when switching address
 spaces with XPTI being active just disable global pages via %cr4
 completely when a domain subject to XPTI is active. This avoids the
 need for extra TLB flushes as loading %cr3 will remove all TLB
 entries.
>>>
>>> I continue to be not entirely convinced of this move. I had an
>>> alternative in mind: Since retaining global pages is particularly
>>> relevant for switches between guest user and guest kernel
>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>> switch_to_kernel to restore_all_guest without ever switching to
>>> the full page Xen tables?
>>
>> With patch 7 of this series in mind I'm not convinced the extra effort
>> is really making sense. Today most processors do have PCID support so
>> for that old hardware I don't think we need to make the handling even
>> more complex.
> 
> PCID yes, but INVPCID (and we're talking about Intel hardware
> here only anyway)? But yes, the extra complexity is what has kept
> me so far from investing time here.

As PCID seems to speed up XPTI only and XPTI is necessary for INTEL cpus
only I don't see a problem here. :-)

> 
 --- a/xen/arch/x86/mm.c
 +++ b/xen/arch/x86/mm.c
 @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
  void write_ptbase(struct vcpu *v)
  {
  struct cpu_info *cpu_info = get_cpu_info();
 +unsigned long new_cr4;
 +
 +new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
 +  ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>
>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>> This should really only be used for priming certain values imo,
>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>> because we've just got rid of the blanket reversion to
>>> mmu_cr4_features in VMX code.
>>
>> I do understand that using mmu_cr4_features isn't the best way to set
>> cr4. But I think it is a good idea to have a default value which should
>> normally be used instead of only switching various bits on and off.
>>
>> In case cr4 is loaded with a strange value in some corner case that
>> value might be used from then on instead of being repaired by loading a
>> dedicated value at certain points in time, e.g. when doing a context
>> switch.
> 
> But that would make it even more difficult to notice and debug a
> possible problem. The more we play with CR4 bits, the more
> important it is that we keep an accurate record of what is currently
> loaded into it, and that we have a clear understanding of what we
> mean to be loaded into the register at any given point in time.

What about adding an appropriate ASSERT() for that case?

So I would use read_cr4() | X86_CR4_PGE and ASSERT() that cr4 matches
the default value.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH for-4.11] tools/xenstore: fix linking libxenstore with ldl

2018-03-23 Thread Juergen Gross
Commit 448c03b3cbe1487 ("tools/xenstore: try to get minimum thread
stack size for watch thread") added a dependency to libdl to
libxenstore. Unfortunately the way it was added requires now all
users of libxenstore to specify "-ldl" when linking. This can be
avoided by linking libxenstore.so specifying "-ldl" as a trailing
option. So use APPEND_LDFLAGS instead of LDFLAGS for adding the
"-ldl" option when linking libxenstore.so.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 tools/xenstore/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/xenstore/Makefile b/tools/xenstore/Makefile
index 69e55e73e5..445e9911b2 100644
--- a/tools/xenstore/Makefile
+++ b/tools/xenstore/Makefile
@@ -104,7 +104,7 @@ libxenstore.so.$(MAJOR): libxenstore.so.$(MAJOR).$(MINOR)
 xs.opic: CFLAGS += -DUSE_PTHREAD
 ifeq ($(CONFIG_Linux),y)
 xs.opic: CFLAGS += -DUSE_DLSYM
-libxenstore.so.$(MAJOR).$(MINOR): LDFLAGS += -ldl
+libxenstore.so.$(MAJOR).$(MINOR): APPEND_LDFLAGS += -ldl
 else
 PKG_CONFIG_REMOVE += -ldl
 endif
-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active

2018-03-23 Thread Juergen Gross
On 23/03/18 09:14, Jan Beulich wrote:
 On 23.03.18 at 08:58,  wrote:
>> On 23/03/18 08:46, Jan Beulich wrote:
>> On 22.03.18 at 19:18,  wrote:
 On 22/03/18 17:30, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
>
> I continue to be not entirely convinced of this move. I had an
> alternative in mind: Since retaining global pages is particularly
> relevant for switches between guest user and guest kernel
> modes, what if we made a shortcut from e.g. lstar_enter through
> switch_to_kernel to restore_all_guest without ever switching to
> the full page Xen tables?

 With patch 7 of this series in mind I'm not convinced the extra effort
 is really making sense. Today most processors do have PCID support so
 for that old hardware I don't think we need to make the handling even
 more complex.
>>>
>>> PCID yes, but INVPCID (and we're talking about Intel hardware
>>> here only anyway)? But yes, the extra complexity is what has kept
>>> me so far from investing time here.
>>
>> As PCID seems to speed up XPTI only and XPTI is necessary for INTEL cpus
>> only I don't see a problem here. :-)
> 
> Well, yes as far as AMD is unaffected, but not entirely yes to the
> rest, as I intentionally pointed out the difference of availability of
> PCID (which even Westmere's have) and INVPCID (which only my
> Haswell has).

Haswells are out for nearly 5 years now.

I think in case we want to do something else here we should delay that
to 4.12. Even without PCID this patch is speeding up XPTI handling
significantly (parallel hypervisor compilation: elapsed time -6%,
system time -12%), so I'm not making anything worse.

BTW: while Westmere is supporting PCID I remember having done some PCID
testing with Westmere not showing any performance gains at all.

> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>  struct cpu_info *cpu_info = get_cpu_info();
>> +unsigned long new_cr4;
>> +
>> +new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>> +  ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>
> I'm not overly happy to see any new uses of mmu_cr4_features.
> This should really only be used for priming certain values imo,
> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
> so too, and perhaps better wouldn't). Hence I wonder whether
> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
> because we've just got rid of the blanket reversion to
> mmu_cr4_features in VMX code.

 I do understand that using mmu_cr4_features isn't the best way to set
 cr4. But I think it is a good idea to have a default value which should
 normally be used instead of only switching various bits on and off.

 In case cr4 is loaded with a strange value in some corner case that
 value might be used from then on instead of being repaired by loading a
 dedicated value at certain points in time, e.g. when doing a context
 switch.
>>>
>>> But that would make it even more difficult to notice and debug a
>>> possible problem. The more we play with CR4 bits, the more
>>> important it is that we keep an accurate record of what is currently
>>> loaded into it, and that we have a clear understanding of what we
>>> mean to be loaded into the register at any given point in time.
>>
>> What about adding an appropriate ASSERT() for that case?
>>
>> So I would use read_cr4() | X86_CR4_PGE and ASSERT() that cr4 matches
>> the default value.
> 
> That's an option; later we may want to sprinkle around a few more
> such assertions.

Okay, I'll go that route then. Do you want me to use a new default_cr4
variable for doing the assertion (and as initial cr4 value for secondary
cpus), or are you fine with using mmu_cr4_features for now?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 0/8] x86: Meltdown band-aid overhead reduction

2018-03-23 Thread Juergen Gross
On 19/03/18 14:32, Jan Beulich wrote:
> 1: NOP out most XPTI entry/exit code when it's not in use
> 2: disable XPTI when RDCL_NO
> 3: x86: log XPTI enabled status
> 4: use %r12 to write zero into xen_cr3
> 5: reduce .text.entry
> 6: enable interrupts earlier with XPTI disabled
> 7: also NOP out xen_cr3 restores of XPTI
> 8: avoid double CR3 reload when switching to guest user mode
> 
> Signed-off-by: Jan Beulich 
> ---
> v4: Main change is the split of patch 1.

Andrew, any chance you could review patches 1 and 5-8 in order to
unblock this series and mine?


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-23 Thread Juergen Gross
On 23/03/18 11:51, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> Avoid flushing the complete TLB when switching %cr3 for mitigation of
>> Meltdown by using the PCID feature if available.
>>
>> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
>> 2 values for the non-XPTI case:
>>
>> - guest active and in kernel mode
>> - guest active and in user mode
>> - hypervisor active and guest in user mode (XPTI only)
>> - hypervisor active and guest in kernel mode (XPTI only)
> 
> Before committing to this route, Jun, Kevin, can we please get
> confirmation that PCID isn't (and isn't going to be) subject to the
> same speculation issues in the pipeline that the U/S bit is (having
> caused Meltdown in the first place)? To me it seems a pretty
> likely thing to play all the same games.

Really? This would assume either the processor is capable to deal with
multiple matching TLB entries when speculating or that there can be
only one entry per virtual address present in the TLB at the same time
in spite of different PCIDs.

And why aren't you asking for the same problem with VPIDs? This should
be comparable to the PCID problem you are suspecting.

> 
>> ---
>>  docs/misc/xen-command-line.markdown | 12 +
>>  xen/arch/x86/debug.c|  3 ++-
>>  xen/arch/x86/domain_page.c  |  2 +-
>>  xen/arch/x86/domctl.c   |  4 +++
>>  xen/arch/x86/flushtlb.c | 49 --
>>  xen/arch/x86/mm.c   | 34 +---
>>  xen/arch/x86/pv/dom0_build.c|  1 +
>>  xen/arch/x86/pv/domain.c| 52 
>> +
>>  xen/include/asm-x86/domain.h| 14 +++---
>>  xen/include/asm-x86/pv/domain.h |  2 ++
>>  xen/include/asm-x86/x86-defns.h |  1 +
>>  11 files changed, 158 insertions(+), 16 deletions(-)
> 
> Having had the discussion previously, I'm missing a change to
> smp.c:new_tlbflush_clock_period() here.

I can add that. I did not do this as I haven't treated FLUSH_TLB
differently to FLUSH_TLB_GLOBAL (trying this even without any other
change to staging led to degfaults in dom0). So such a change to
new_tlbflush_clock_period() should be a separate patch I believe.

> 
>> --- a/xen/arch/x86/debug.c
>> +++ b/xen/arch/x86/debug.c
>> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t 
>> pgd3val)
>>  l3_pgentry_t l3e, *l3t;
>>  l2_pgentry_t l2e, *l2t;
>>  l1_pgentry_t l1e, *l1t;
>> -unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
>> +unsigned long cr3 = (pgd3val ? pgd3val
>> + : (dp->vcpu[0]->arch.cr3 & 
>> ~X86_CR3_NOFLUSH));
> 
> What about the PCID portion? You want the address of the page
> here, so I think you should use a "white-listing" masking operation
> instead of a blacklisting one.

The PCID portion is no problem here as the value is converted into a
mfn.

I can do the modification you are asking for, of course.

> 
> Also, as you touch this anyway, it would have been nice to drop
> the unnecessary middle argument at the same time.

Okay.

> 
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -102,7 +102,19 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>  t = pre_flush();
>>  
>>  if ( read_cr4() & X86_CR4_PGE )
>> +/*
>> + * X86_CR4_PGE set means PCID being inactive.
>> + * We have to purge the TLB via flipping cr4.pge.
>> + */
>>  write_cr4(cr4 & ~X86_CR4_PGE);
>> +else if ( cpu_has_invpcid )
>> +/*
>> + * If we are using PCID purge the TLB via INVPCID as loading cr3
>> + * will affect the current PCID only.
>> + * If INVPCID is not supported we don't use PCIDs so loading cr3
>> + * will purge the TLB (we are in the "global pages off" branch).
>> + */
>> +invpcid_flush_all();
> 
> Since with CR4.PGE correctness-wise clear it doesn't matter whether
> you use invpcid_flush_all() or invpcid_flush_all_nonglobals() here,
> does it also not matter performance-wise?

I didn't check that. I'll have a try.

> 
>> @@ -131,14 +143,35 @@ unsigned int flush_area_local(const void *va, unsigned 
>> int flags)
>>  {
>>  if ( order == 0 )
>>  {
>> -/*
>> - * We don't INVLPG multi-page regions because the 2M/4M/1G
>> - * region may not have been mapped with a superpage. Also there
>> - * are various errata surrounding INVLPG usage on superpages, 
>> and
>> - * a full flush is in any case not *that* expensive.
>> - */
> 
> This comment really explains the order == 0 check above, so I
> don't think it should be moved.

Okay.

> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -497,12 +497,38 @@ void free_shared_domheap_page(struct page_info *page)
>>  free_domheap_page(page);
>>  }
>>  
>> +/*
>> + * 

Re: [Xen-devel] [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context

2018-03-23 Thread Juergen Gross
On 22/03/18 15:50, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgr...@suse.com> wrote:
>> When switching to a 64-bit pv context the TLB is flushed twice today:
>> the first time when switching to the new address space in
>> write_ptbase(), the second time when switching to guest mode in
>> restore_to_guest.
>>
>> Avoid the first TLB flush in that case.
>>
>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>> ---
>> V3:
>> - omit setting root_pgt_changed to false (Jan Beulich)
>> ---
>>  xen/arch/x86/mm.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>> index 352600ad73..8c944b33c9 100644
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -123,6 +123,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -507,8 +508,14 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>  if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +{
>>  get_cpu_info()->root_pgt_changed = true;
>> -write_cr3(v->arch.cr3);
>> +asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>> +}
>> +else
>> +{
>> +write_cr3(v->arch.cr3);
>> +}
> 
> Unnecessary braces. with that
> Reviewed-by: Jan Beulich <jbeul...@suse.com>
> (This could be taken care of while committing, but the patch
> depends on patch 1 anyway, which may see further
> transformation.)

Just realized it now: I have to re-introduce the conditionals I removed
on your behalf from patch 1, as otherwise I'd omit TLB flushing via
write_cr3() for 32-bit pv-domains and HVM-domains.

Therefor I'm removing your R-b in case you don't like that (patch 3
changes the conditional again, so I don't think that is really bad).


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 7/7] xen/x86: use PCID feature

2018-03-23 Thread Juergen Gross
On 23/03/18 14:46, Jan Beulich wrote:
 On 23.03.18 at 12:29,  wrote:
>> On 23/03/18 11:51, Jan Beulich wrote:
>> On 21.03.18 at 13:51,  wrote:
 Avoid flushing the complete TLB when switching %cr3 for mitigation of
 Meltdown by using the PCID feature if available.

 We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
 2 values for the non-XPTI case:

 - guest active and in kernel mode
 - guest active and in user mode
 - hypervisor active and guest in user mode (XPTI only)
 - hypervisor active and guest in kernel mode (XPTI only)
>>>
>>> Before committing to this route, Jun, Kevin, can we please get
>>> confirmation that PCID isn't (and isn't going to be) subject to the
>>> same speculation issues in the pipeline that the U/S bit is (having
>>> caused Meltdown in the first place)? To me it seems a pretty
>>> likely thing to play all the same games.
>>
>> Really? This would assume either the processor is capable to deal with
>> multiple matching TLB entries when speculating or that there can be
>> only one entry per virtual address present in the TLB at the same time
>> in spite of different PCIDs.
> 
> Hmm, yes, good point.
> 
>> And why aren't you asking for the same problem with VPIDs? This should
>> be comparable to the PCID problem you are suspecting.
> 
> Since Linux is using the approach, I'm not really suspecting a
> problem. I'd just like to be sure there is none / not going to be
> one. None of this is spelled out in the doc after all.
> 
 ---
  docs/misc/xen-command-line.markdown | 12 +
  xen/arch/x86/debug.c|  3 ++-
  xen/arch/x86/domain_page.c  |  2 +-
  xen/arch/x86/domctl.c   |  4 +++
  xen/arch/x86/flushtlb.c | 49 
 --
  xen/arch/x86/mm.c   | 34 +---
  xen/arch/x86/pv/dom0_build.c|  1 +
  xen/arch/x86/pv/domain.c| 52 
 +
  xen/include/asm-x86/domain.h| 14 +++---
  xen/include/asm-x86/pv/domain.h |  2 ++
  xen/include/asm-x86/x86-defns.h |  1 +
  11 files changed, 158 insertions(+), 16 deletions(-)
>>>
>>> Having had the discussion previously, I'm missing a change to
>>> smp.c:new_tlbflush_clock_period() here.
>>
>> I can add that. I did not do this as I haven't treated FLUSH_TLB
>> differently to FLUSH_TLB_GLOBAL (trying this even without any other
>> change to staging led to degfaults in dom0). So such a change to
>> new_tlbflush_clock_period() should be a separate patch I believe.
> 
> Ah yes, you don't have the "flushing too much" issue anymore in
> this version, or at least not in as obvious a way. Or wait - it's in
> patch 4. You still do an including-global flush there regardless of
> whether FLUSH_TLB_GLOBAL was actually requested. Anyway,
> we'll save this aspect for later.
> 
 --- a/xen/arch/x86/debug.c
 +++ b/xen/arch/x86/debug.c
 @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t 
 pgd3val)
  l3_pgentry_t l3e, *l3t;
  l2_pgentry_t l2e, *l2t;
  l1_pgentry_t l1e, *l1t;
 -unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
 +unsigned long cr3 = (pgd3val ? pgd3val
 + : (dp->vcpu[0]->arch.cr3 & 
 ~X86_CR3_NOFLUSH));
>>>
>>> What about the PCID portion? You want the address of the page
>>> here, so I think you should use a "white-listing" masking operation
>>> instead of a blacklisting one.
>>
>> The PCID portion is no problem here as the value is converted into a
>> mfn.
>>
>> I can do the modification you are asking for, of course.
> 
> Well, even if the bits are shifter out in the end, the code could
> look more correct. Plus by masking to just the address, future
> new meaning assigned to currently unused bits would not be
> a problem for this piece of code anymore.
> 
 +/*
 + * Return additional PCID specific cr3 bits.
 + *
 + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
 + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in 
 case
>>>
>>> I stand to my previous comment, which was left unanswered afaics:
>>
>> Uuh, sorry for that.
>>
>>> "Is it a good idea to suppress the flush this way for every reader
>>>  of v->arch.cr3? For example, what about the use in
>>>  dbg_pv_va2mfn()? I think it should be the consumers of the field
>>>  to decide whether to OR in that flag (which isn't part of the
>>>  register value anyway)."
>>> To be more precise, I can see the pcid to be put here (which will
>>> require consumers to be aware anyway), but I don't think the non-
>>> register-value no-flush indicator belongs here. IOW I think after
>>> writing the value into %cr3, the value read back should match the
>>> stored value.
>>
>> This 

Re: [Xen-devel] [PATCH v4 1/4] x86/PVHv2: Add memory map pointer to hvm_start_info struct

2018-03-21 Thread Juergen Gross
On 21/03/18 17:46, Maran Wilson wrote:
> On 3/21/2018 2:40 AM, Juergen Gross wrote:
>> On 21/03/18 10:28, Roger Pau Monné wrote:
>>> On Tue, Mar 20, 2018 at 09:48:56AM -0700, Maran Wilson wrote:
>>>> +/*
>>>>    * C representation of the x86/HVM start info layout.
>>>>    *
>>>>    * The canonical definition of this layout is above, this is just
>>>> a way to
>>>> @@ -86,6 +134,14 @@ struct hvm_start_info {
>>>>   uint64_t cmdline_paddr; /* Physical address of the command
>>>> line. */
>>>>   uint64_t rsdp_paddr;    /* Physical address of the RSDP
>>>> ACPI data    */
>>>>   /*
>>>> structure.    */
>>>> +    uint64_t memmap_paddr;  /* Physical address of an array
>>>> of   */
>>>> +    /* hvm_memmap_table_entry. Only
>>>> present in   */
>>>> +    /* version 1 and newer of the
>>>> structure  */
>>>> +    uint32_t memmap_entries;    /* Number of entries in the memmap
>>>> table.    */
>>>> +    /* Only present in version 1 and
>>>> newer of    */
>>>> +    /* the structure. Value will be
>>>> zero if  */
>>>> +    /* there is no memory map being
>>>> provided.    */
>>>> +    uint32_t reserved;  /* Must be zero for Version
>>>> 1.   */
>>> I would write "Must be zero." only. If at some point we introduce
>>> version 2 we would likely have to fixup this comment to mention
>>> version 1 and version 2.
>> In case you are going this route I'd suggest to drop the version remarks
>> for the individual fields and just add a comment like:
>>
>> /* All following fields only present in version 1 and newer. */
>>
>> above memmap_paddr.
> 
> OK, so combining the above suggestions, I'd have the following. Is the
> formatting and alignment of comments what you had in mind and acceptable
> to all?

Looks good to me.


Juergen

> 
> struct hvm_start_info {
>     uint32_t magic; /* Contains the magic value
> 0x336ec578   */
>     /* ("xEn3" with the 0x80 bit of the "E"
> set).*/
>     uint32_t version;   /* Version of this
> structure.    */
>     uint32_t flags; /* SIF_xxx
> flags.    */
>     uint32_t nr_modules;    /* Number of modules passed to the
> kernel.   */
>     uint64_t modlist_paddr; /* Physical address of an array
> of   */
>     /*
> hvm_modlist_entry.    */
>     uint64_t cmdline_paddr; /* Physical address of the command
> line. */
>     uint64_t rsdp_paddr;    /* Physical address of the RSDP ACPI
> data    */
>     /*
> structure.    */
>     /* All following fields only present in version 1 and newer */
>     uint64_t memmap_paddr;  /* Physical address of an array
> of   */
>     /*
> hvm_memmap_table_entry.   */
>     uint32_t memmap_entries;    /* Number of entries in the memmap
> table.    */
>     /* Value will be zero if there is no
> memory  */
>     /* map being
> provided.   */
>     uint32_t reserved;  /* Must be
> zero. */
> };
> 
> Thanks,
> -Maran
> 
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Xen 4.11 Development Update

2018-03-18 Thread Juergen Gross
This email only tracks big items for xen.git tree. Please reply for items you
would like to see in 4.11 so that people have an idea what is going on and
prioritise accordingly.

You're welcome to provide description and use cases of the feature you're
working on.

= Timeline =

We now adopt a fixed cut-off date scheme. We will release twice a
year. The upcoming 4.11 timeline are as followed:

* Last posting date: March 16th, 2018
  We are here
* Hard code freeze: March 30th, 2018
* RC1: TBD
* Release: June 1st, 2018

Note that we don't have freeze exception scheme anymore. We are not
accepting new features any more for Xen 4.11 now.
Patche series pending will need to be committed before March 30th, 2018 to
make it into 4.11.
All other patches will be automatically queued into next release.

RCs will be arranged immediately after freeze.

We recently introduced a jira instance to track all the tasks (not only big)
for the project. See: https://xenproject.atlassian.net/projects/XEN/issues.

Most of the tasks tracked by this e-mail also have a corresponding jira task
referred by XEN-N.

I have started to include the version number of series associated to each
feature. Can each owner send an update on the version number if the series
was posted upstream?

= Projects =

== Hypervisor == 

*  Per-cpu tasklet
  -  XEN-28
  -  Konrad Rzeszutek Wilk

=== x86 === 

*  guest resource mapping (v17)
  -  Paul Durrant

*  vNVDIMM support for HVM guest (RFC v4)
  -  XEN-45
  -  Haozhong Zhang

*  hypervisor x86 instruction emulator additions (v4)
  -  Jan Beulich

*  PV-IOMMU
  -  Paul Durrant

*  HVM guest CPU topology support (RFC)
  -  Chao Gao

*  Vixen: A PV-in-HVM shim (v3)
  -  Anthony Liguori

*  Intel Processor Trace virtualization enabling (v1)
  -  Luwei Kang

*  PCI config space emulation in Xen for PVH Dom0 (v8)
  -  Roger Paul Monné

*  XPTI speedup (v2)
  -  Juergen Gross

=== ARM === 

*  SMMUv3 driver (RFC v4)
  -  Sameer Goel

*  IORT support (RFC)
  -  Manish Jaggi

*  Implement branch predictor hardening for affected Cortex-A CPUs (v1)
  -  Julien Grall

*  New VGIC(-v2) implementation (v1)
  -  Andre Przywara

== Grub2 == 

*  Support PVH guest boot (v1)
  -  Juergen Gross

== Completed == 

=== x86 === 

*  Add dmops to allow use of VGA with restricted QEMU
  -  Ross Lagerwall

*  Enable Memory Bandwidth Allocation in Xen
  -  XEN-48
  -  Yi Sun

*  Comet: Run PV in PVH container
  -  Wei Liu

*  Mitigations for SP2/CVE-2017-5715/Branch Target Injection
  -  Andrew Cooper

*  Mitigations for Meltdown/CVE-2017-5754
  -  Jan Beulich


Juergen Gross

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 1/4] x86/PVHv2: Add memory map pointer to hvm_start_info struct

2018-03-16 Thread Juergen Gross
On 16/03/18 12:48, Jan Beulich wrote:
 On 16.03.18 at 12:37,  wrote:
>> On Fri, Mar 16, 2018 at 05:29:27AM -0600, Jan Beulich wrote:
>> On 16.03.18 at 12:11,  wrote:
 On Thu, Mar 15, 2018 at 02:33:09PM -0700, Maran Wilson wrote:
> @@ -48,6 +49,15 @@
>   * 32 ++
>   *| rsdp_paddr | Physical address of the RSDP ACPI data 
> structure.
>   * 40 ++
> + *| memmap_paddr   | Physical address of the (optional) memory map. 
>> Only
> + *|| present in version 1 and newer of the structure.
> + * 48 ++
> + *| memmap_entries | Number of entries in the memory map table. Zero
> + *|| if there is no memory map being provided. Only

 Again (as I've mentioned in previous reviews) the way to signal a
 non-present memory map is to set memmap_paddr to 0, not memmap_entries
 to 0. This is already covered by the comment at the top of the header,
 which states:

 NOTE: nothing will be loaded at physical address 0, so a 0 value in any
 of the address fields should be treated as not present.
>>>
>>> I still think it is legitimate to direct consumers to look at the entry
>>> count here.
>>
>> We have another similar field tuple already, modlist_paddr and
>> nr_modules and in that case (according to the current comments) the
>> proper way to check if there are modules is to check modlist_paddr !=
>> 0 and then get the count from nr_modules.
> 
> Well, that's the way it is now, as it's out in the wild already.
> 
>> Using this access strategy we avoid adding more comments about
>> checking nr_modules != 0 before trying to access modlist_paddr.
> 
> I don't follow this argumentation: There is an obvious implication
> that only nr_modules entries are valid to access at modlist_paddr.
> If nr_modules is zero, no entry is valid to access. Nothing to be
> said explicitly in the comment.

The 0 address is important for cases without a count field (e.g. the
rsdp_paddr field).

In case where a count is supplied I'd say a count value of 0 should be
used to indicate no entry is present. Setting the address to zero in
this case, too, should be allowed, of course.

I'd regard an address of zero and count > 0 as invalid.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/2] x86: report if PCID and INVPCID are supported

2018-03-05 Thread Juergen Gross
On 05/03/18 10:50, Wei Liu wrote:
> Signed-off-by: Wei Liu <wei.l...@citrix.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 2/2] x86: use invpcid to do global flushing

2018-03-05 Thread Juergen Gross
On 05/03/18 10:50, Wei Liu wrote:
> Signed-off-by: Wei Liu <wei.l...@citrix.com>

Reviewed-by: Juergen Gross <jgr...@suse.com>


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/2] x86: report if PCID and INVPCID are supported

2018-03-05 Thread Juergen Gross
On 05/03/18 12:20, Jan Beulich wrote:
 On 05.03.18 at 10:50,  wrote:
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1701,6 +1701,13 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>> cpu_has_nx ? XENLOG_INFO : XENLOG_WARNING "Warning: ",
>> cpu_has_nx ? "" : "not ");
>>  
>> +
>> +printk(XENLOG_INFO
>> +   "PCID (Process-Context IDentifier) %ssupported\n",
>> +   cpu_has_pcid ? "" : "not ");
>> +
>> +printk(XENLOG_INFO "INVPCID %ssupported\n", cpu_has_invpcid ? "" : "not 
>> ");
> 
> Do we really need this? We log a message for NX as an exception,
> we don't do so for other features (and things would get pretty
> unwieldy if we did).

I'd rather keep this message. As we are hiding PCID and INPCID from dom0
this is the only indicator of those features being supported. In case of
an error related to TLB consistency this information is important IMO.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

<    5   6   7   8   9   10   11   12   13   14   >