Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-15 Thread Jan Beulich
>>> On 14.02.18 at 17:23,  wrote:
> The current XPTI implementation isolates the directmap (and therefore a lot 
> of
> guest data), but a large quantity of CPU0's state (including its stack)
> remains visible.
> 
> Furthermore, an attacker able to read .text is in a vastly superior position
> to normal when it comes to fingerprinting Xen for known vulnerabilities, or
> scanning for ROP/Spectre gadgets.
> 
> Collect together the entrypoints in .text.entry (currently 3x4k frames, but
> can almost certainly be slimmed down), and create a common mapping which is
> inserted into each per-cpu shadow.  The stubs are also inserted into this
> mapping by pointing at the in-use L2.  This allows stubs allocated later (SMP
> boot, or CPU hotplug) to work without further changes to the common 
> mappings.
> 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Wei Liu 
> CC: Juergen Gross 
> 
> v2:
>  * Drop "incomplete" warning from the docs.  This is about as complete as it
>can reasonably get.
>  * Correct BUILD_BUG_ON() calculation, duplicate in the common_pgt code
>  * scope/const improvements
>  * Use .push/.popsection in preference to .previous
>  * Exclude {compat_}create_bounce_frame from .text.entry
>  * Extend the sanity checking of linear in clone_mapping().  There is a 
> latent
>bug where a bad linear parameter would cause Xen to fall over a NULL 
> pointer.
> ---
>  docs/misc/xen-command-line.markdown |  3 --
>  xen/arch/x86/smpboot.c  | 56 
> +
>  xen/arch/x86/x86_64/compat/entry.S  |  5 
>  xen/arch/x86/x86_64/entry.S | 11 ++--
>  xen/arch/x86/xen.lds.S  |  7 +
>  5 files changed, 71 insertions(+), 11 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.markdown 
> b/docs/misc/xen-command-line.markdown
> index 79feba6..8317639 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1935,9 +1935,6 @@ mode.
>  Override default selection of whether to isolate 64-bit PV guest page
>  tables.
>  
> -** WARNING: Not yet a complete isolation implementation, but better than
> -nothing. **
> -
>  ### xsave
>  > `= `
>  
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index 2ebef03..10bf2f3 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned 
> long *mfn)
>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
>  }
>  
> +/* Confirm that all stubs fit in a single L2 pagetable. */
> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L3_PAGETABLE_SHIFT));

It's a little confusing that the comment talks about L2 yet the check
is on L3. May I suggest you either use
L2_PAGETABLE_ENTRIES << L2_PAGETABLE_SHIFT or say "are
covered by a single L3 entry" in the comment?

> @@ -646,13 +649,23 @@ static int clone_mapping(const void *ptr, 
> root_pgentry_t *rpt)
>  {
>  unsigned long linear = (unsigned long)ptr, pfn;
>  unsigned int flags;
> -l3_pgentry_t *pl3e = 
> l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) +
> - l3_table_offset(linear);
> +l3_pgentry_t *pl3e;
>  l2_pgentry_t *pl2e;
>  l1_pgentry_t *pl1e;
>  
> -if ( linear < DIRECTMAP_VIRT_START )
> -return 0;
> +/*
> + * Sanity check 'linear'.  We only allow cloning from the Xen virtual
> + * range, and in particular, only from the directmap and .text ranges.
> + */
> +if ( root_table_offset(linear) > ROOT_PAGETABLE_LAST_XEN_SLOT ||
> + root_table_offset(linear) < ROOT_PAGETABLE_FIRST_XEN_SLOT )
> +return -EINVAL;
> +if ( !(linear >= DIRECTMAP_VIRT_START ||
> +   (linear >= XEN_VIRT_START && linear < XEN_VIRT_END)) )

Putting it this way is certainly fine, but generally I find using !() on
logical || or && expressions less easy to understand than pushing the
negation through the entire expression. Once having done so, it
might even be debatable whether

if ( linear < XEN_VIRT_START ||
 (linear >= XEN_VIRT_END && linear < DIRECTMAP_VIRT_START)) )
return -EINVAL;

wouldn't be better.

With at least the first (comment vs code) aspect taken care of in
both instances
Reviewed-by: Jan Beulich 

Jan

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Andrew Cooper
The current XPTI implementation isolates the directmap (and therefore a lot of
guest data), but a large quantity of CPU0's state (including its stack)
remains visible.

Furthermore, an attacker able to read .text is in a vastly superior position
to normal when it comes to fingerprinting Xen for known vulnerabilities, or
scanning for ROP/Spectre gadgets.

Collect together the entrypoints in .text.entry (currently 3x4k frames, but
can almost certainly be slimmed down), and create a common mapping which is
inserted into each per-cpu shadow.  The stubs are also inserted into this
mapping by pointing at the in-use L2.  This allows stubs allocated later (SMP
boot, or CPU hotplug) to work without further changes to the common mappings.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Wei Liu 
CC: Juergen Gross 

v2:
 * Drop "incomplete" warning from the docs.  This is about as complete as it
   can reasonably get.
 * Correct BUILD_BUG_ON() calculation, duplicate in the common_pgt code
 * scope/const improvements
 * Use .push/.popsection in preference to .previous
 * Exclude {compat_}create_bounce_frame from .text.entry
 * Extend the sanity checking of linear in clone_mapping().  There is a latent
   bug where a bad linear parameter would cause Xen to fall over a NULL pointer.
---
 docs/misc/xen-command-line.markdown |  3 --
 xen/arch/x86/smpboot.c  | 56 +
 xen/arch/x86/x86_64/compat/entry.S  |  5 
 xen/arch/x86/x86_64/entry.S | 11 ++--
 xen/arch/x86/xen.lds.S  |  7 +
 5 files changed, 71 insertions(+), 11 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown 
b/docs/misc/xen-command-line.markdown
index 79feba6..8317639 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1935,9 +1935,6 @@ mode.
 Override default selection of whether to isolate 64-bit PV guest page
 tables.
 
-** WARNING: Not yet a complete isolation implementation, but better than
-nothing. **
-
 ### xsave
 > `= `
 
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 2ebef03..10bf2f3 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned 
long *mfn)
 unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
 }
 
+/* Confirm that all stubs fit in a single L2 pagetable. */
+BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L3_PAGETABLE_SHIFT));
+
 stub_va = XEN_VIRT_END - (cpu + 1) * PAGE_SIZE;
 if ( map_pages_to_xen(stub_va, mfn_x(page_to_mfn(pg)), 1,
   PAGE_HYPERVISOR_RX | MAP_SMALL_PAGES) )
@@ -646,13 +649,23 @@ static int clone_mapping(const void *ptr, root_pgentry_t 
*rpt)
 {
 unsigned long linear = (unsigned long)ptr, pfn;
 unsigned int flags;
-l3_pgentry_t *pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) +
- l3_table_offset(linear);
+l3_pgentry_t *pl3e;
 l2_pgentry_t *pl2e;
 l1_pgentry_t *pl1e;
 
-if ( linear < DIRECTMAP_VIRT_START )
-return 0;
+/*
+ * Sanity check 'linear'.  We only allow cloning from the Xen virtual
+ * range, and in particular, only from the directmap and .text ranges.
+ */
+if ( root_table_offset(linear) > ROOT_PAGETABLE_LAST_XEN_SLOT ||
+ root_table_offset(linear) < ROOT_PAGETABLE_FIRST_XEN_SLOT )
+return -EINVAL;
+if ( !(linear >= DIRECTMAP_VIRT_START ||
+   (linear >= XEN_VIRT_START && linear < XEN_VIRT_END)) )
+return -EINVAL;
+
+pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) +
+l3_table_offset(linear);
 
 flags = l3e_get_flags(*pl3e);
 ASSERT(flags & _PAGE_PRESENT);
@@ -744,8 +757,12 @@ static __read_mostly int8_t opt_xpti = -1;
 boolean_param("xpti", opt_xpti);
 DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
 
+extern const char _stextentry[], _etextentry[];
+
 static int setup_cpu_root_pgt(unsigned int cpu)
 {
+static root_pgentry_t common_pgt;
+
 root_pgentry_t *rpt;
 unsigned int off;
 int rc;
@@ -764,8 +781,35 @@ static int setup_cpu_root_pgt(unsigned int cpu)
 idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
 /* SH_LINEAR_PT inserted together with guest mappings. */
 /* PERDOMAIN inserted during context switch. */
-rpt[root_table_offset(XEN_VIRT_START)] =
-idle_pg_table[root_table_offset(XEN_VIRT_START)];
+
+/* One-time setup of common_pgt, which maps .text.entry and the stubs. */
+if ( unlikely(!root_get_intpte(common_pgt)) )
+{
+unsigned long stubs_linear = XEN_VIRT_END - 1;
+l3_pgentry_t *stubs_main, *stubs_shadow;
+const char *ptr;
+
+for ( rc = 0, ptr = _stextentry;
+  !rc && ptr < _etextentry; ptr += PAGE_SIZE )
+rc = clone_mapping(ptr, rpt);
+
+if ( rc )

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Andrew Cooper
On 14/02/18 12:27, Juergen Gross wrote:
> On 14/02/18 13:19, Andrew Cooper wrote:
>> On 14/02/18 12:15, Juergen Gross wrote:
>>> On 14/02/18 13:03, Juergen Gross wrote:
 On 14/02/18 12:48, Andrew Cooper wrote:
> On 14/02/18 07:54, Juergen Gross wrote:
>> On 13/02/18 20:45, Andrew Cooper wrote:
>>> The current XPTI implementation isolates the directmap (and therefore a 
>>> lot of
>>> guest data), but a large quantity of CPU0's state (including its stack)
>>> remains visible.
>>>
>>> Furthermore, an attacker able to read .text is in a vastly superior 
>>> position
>>> to normal when it comes to fingerprinting Xen for known 
>>> vulnerabilities, or
>>> scanning for ROP/Spectre gadgets.
>>>
>>> Collect together the entrypoints in .text.entry (currently 3x4k frames, 
>>> but
>>> can almost certainly be slimmed down), and create a common mapping 
>>> which is
>>> inserted into each per-cpu shadow.  The stubs are also inserted into 
>>> this
>>> mapping by pointing at the in-use L2.  This allows stubs allocated 
>>> later (SMP
>>> boot, or CPU hotplug) to work without further changes to the common 
>>> mappings.
>>>
>>> Signed-off-by: Andrew Cooper 
>>> ---
>>> CC: Jan Beulich 
>>> CC: Wei Liu 
>>> CC: Juergen Gross 
>>>
>>> RFC, because I don't think the stubs handling is particularly sensible.
>>>
>>> We allocate 4k of virtual address space per CPU, but squash loads of 
>>> CPUs
>>> together onto a single MFN.  The stubs ought to be isolated as well (as 
>>> they
>>> leak the virtual addresses of each stack), which can be done by 
>>> allocating an
>>> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this 
>>> point, we
>>> can't use a common set of mappings, and will have to clone the single 
>>> stub and
>>> .entry.text into each PCPUs copy of the pagetables.
>>>
>>> Also, my plan to cause .text.entry to straddle a 512TB boundary (and 
>>> therefore
>>> avoid any further pagetable allocations) has come a little unstuck 
>>> because of
>>> CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
>>> rearrange the virtual layout for this plan to work.
>>> ---
>>>  xen/arch/x86/smpboot.c | 37 
>>> -
>>>  xen/arch/x86/x86_64/compat/entry.S |  2 ++
>>>  xen/arch/x86/x86_64/entry.S|  4 +++-
>>>  xen/arch/x86/xen.lds.S |  7 +++
>>>  4 files changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>>> index 2ebef03..2519141 100644
>>> --- a/xen/arch/x86/smpboot.c
>>> +++ b/xen/arch/x86/smpboot.c
>>> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, 
>>> unsigned long *mfn)
>>>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, 
>>> PAGE_SIZE));
>>>  }
>>>  
>>> +/* Confirm that all stubs fit in a single L2 pagetable. */
>>> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
>> So we limit NR_CPUS to be max 512 now?
> Not intentionally.  The PAGE_SIZE should be dropped.  (One full L2
> pagetable allows us to map 512*512 pages).
 L2_PAGETABLE_SHIFT is 21. So I still don't get why dropping PAGE_SIZE
 will correct it. OTOH I'm quite sure the BUILD_BUG_ON() won't trigger
 any more with PAGE_SIZE being dropped. :-)

>> Maybe you should use STUB_BUF_SIZE instead of PAGE_SIZE?
> No - that would be incorrect because of the physical packing of stubs
> which occurs.
>
>> BTW: Is there any reason we don't use a common stub page mapped to each
>> per-cpu stack area? The stack address can easily be obtained via %rip
>> relative addressing then (see my patch for the per-vcpu stacks:
>> https://lists.xen.org/archives/html/xen-devel/2018-02/msg00773.html )
> I don't understand what you are asking here.  We cannot access the
> per-cpu area with plain rip-retaliative addressing without using gs base
> (and we really don't want to go down that route), or without per-cpu
> pagetables (which would have to be a compile time choice).
 The stub-page of a cpu is currently mapped as the 3rd page of the
 stack area. So the distance to the primary stack would be the same
 for all cpus (a little bit less than 20kB).

> As for why the per-cpu areas aren't mapped, that's because they aren't
> needed at the moment.  Any decision to change this needs to weigh the
> utility of mapping the areas vs the additional data leakage, which is
> substantial.
 The stack area is mapped. And that's where the stub is living.
>>> Oh, did I mix up something? I followed the comments in 

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Andrew Cooper
On 14/02/18 09:14, Jan Beulich wrote:
 On 13.02.18 at 20:45,  wrote:
>> RFC, because I don't think the stubs handling is particularly sensible.
>>
>> We allocate 4k of virtual address space per CPU, but squash loads of CPUs
>> together onto a single MFN.  The stubs ought to be isolated as well (as they
>> leak the virtual addresses of each stack), which can be done by allocating an
>> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this point, we
>> can't use a common set of mappings, and will have to clone the single stub 
>> and
>> .entry.text into each PCPUs copy of the pagetables.
> The 4k-per-CPU allocation of VA space is probably not strictly
> necessary - quoting the original commit message:
> "While sharing physical pages among certain CPUs on the same node, for
>  now the virtual mappings get established in distinct pages for each
>  CPU. This isn't a strict requirement, but simplifies VA space
>  management for this initial implementation: Sharing VA space would
>  require additional tracking of which areas are currently in use. If
>  the VA and/or TLB overhead turned out to be a problem, such extra code
>  could easily be added."
>
> Without allocating a page per CPU I think what you do is sensible.
> And I don't see how hiding stubs of other CPUs would help much -
> an attacker can gather stack locations from various CPUs as its
> vCPU is being moved around, and you can't hide the current CPU's
> stub space.

Well - scenarios with pinning or cpupools would have a restricted access
to other cpu stack locations.

If the stubs were split in half, we could at least separate the syscall
stubs from the emulation stubs, and leave the latter unmapped.

>
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned 
>> long *mfn)
>>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
>>  }
>>  
>> +/* Confirm that all stubs fit in a single L2 pagetable. */
>> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
> Perhaps duplicate this in setup_cpu_root_pgt() (suitably adjusted
> of course, as Jürgen has already pointed out)?
>
>> @@ -651,9 +654,6 @@ static int clone_mapping(const void *ptr, root_pgentry_t 
>> *rpt)
>>  l2_pgentry_t *pl2e;
>>  l1_pgentry_t *pl1e;
>>  
>> -if ( linear < DIRECTMAP_VIRT_START )
>> -return 0;
> Isn't outright removing this going a little too far?

Considering it is buggy, no.  Retuning success after doing nothing is
very antisocial.

If anything, it should return -EINVAL so the caller knows that something
expected happened, but it should also reject attempts to clone mappings
beyond the end of the directmap (as those will explode when we switch to
PV context), and the XEN_VIRT_* range needs white-listing as ok to clone
for this usecase to work.

>> --- a/xen/arch/x86/x86_64/compat/entry.S
>> +++ b/xen/arch/x86/x86_64/compat/entry.S
>> @@ -13,6 +13,8 @@
>>  #include 
>>  #include 
>>  
>> +.section .text.entry, "ax", @progbits
>> +
>>  ENTRY(entry_int82)
>>  ASM_CLAC
>>  pushq $0
> This also puts compat_create_bounce_frame into the entry section,
> which surely isn't needed. Same for the non-compat variant.

I considered that.  Excluding those ranges won't reduce the number of
allocations we make, but it does complicate backports.

Also, I still fully intend to make the functions disappear into C and
move out of .text.entry that way.

>
>> @@ -854,7 +856,7 @@ GLOBAL(autogen_entrypoints)
>>  .popsection
>>  .endm
>>  
>> -.text
>> +.previous
>>  autogen_stubs: /* Automatically generated stubs. */
> Perhaps better to switch the earlier .section to .pushsection, and
> use .popsection here?

Will do.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Juergen Gross
On 14/02/18 13:19, Andrew Cooper wrote:
> On 14/02/18 12:15, Juergen Gross wrote:
>> On 14/02/18 13:03, Juergen Gross wrote:
>>> On 14/02/18 12:48, Andrew Cooper wrote:
 On 14/02/18 07:54, Juergen Gross wrote:
> On 13/02/18 20:45, Andrew Cooper wrote:
>> The current XPTI implementation isolates the directmap (and therefore a 
>> lot of
>> guest data), but a large quantity of CPU0's state (including its stack)
>> remains visible.
>>
>> Furthermore, an attacker able to read .text is in a vastly superior 
>> position
>> to normal when it comes to fingerprinting Xen for known vulnerabilities, 
>> or
>> scanning for ROP/Spectre gadgets.
>>
>> Collect together the entrypoints in .text.entry (currently 3x4k frames, 
>> but
>> can almost certainly be slimmed down), and create a common mapping which 
>> is
>> inserted into each per-cpu shadow.  The stubs are also inserted into this
>> mapping by pointing at the in-use L2.  This allows stubs allocated later 
>> (SMP
>> boot, or CPU hotplug) to work without further changes to the common 
>> mappings.
>>
>> Signed-off-by: Andrew Cooper 
>> ---
>> CC: Jan Beulich 
>> CC: Wei Liu 
>> CC: Juergen Gross 
>>
>> RFC, because I don't think the stubs handling is particularly sensible.
>>
>> We allocate 4k of virtual address space per CPU, but squash loads of CPUs
>> together onto a single MFN.  The stubs ought to be isolated as well (as 
>> they
>> leak the virtual addresses of each stack), which can be done by 
>> allocating an
>> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this 
>> point, we
>> can't use a common set of mappings, and will have to clone the single 
>> stub and
>> .entry.text into each PCPUs copy of the pagetables.
>>
>> Also, my plan to cause .text.entry to straddle a 512TB boundary (and 
>> therefore
>> avoid any further pagetable allocations) has come a little unstuck 
>> because of
>> CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
>> rearrange the virtual layout for this plan to work.
>> ---
>>  xen/arch/x86/smpboot.c | 37 
>> -
>>  xen/arch/x86/x86_64/compat/entry.S |  2 ++
>>  xen/arch/x86/x86_64/entry.S|  4 +++-
>>  xen/arch/x86/xen.lds.S |  7 +++
>>  4 files changed, 44 insertions(+), 6 deletions(-)
>>
>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>> index 2ebef03..2519141 100644
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, 
>> unsigned long *mfn)
>>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, 
>> PAGE_SIZE));
>>  }
>>  
>> +/* Confirm that all stubs fit in a single L2 pagetable. */
>> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
> So we limit NR_CPUS to be max 512 now?
 Not intentionally.  The PAGE_SIZE should be dropped.  (One full L2
 pagetable allows us to map 512*512 pages).
>>> L2_PAGETABLE_SHIFT is 21. So I still don't get why dropping PAGE_SIZE
>>> will correct it. OTOH I'm quite sure the BUILD_BUG_ON() won't trigger
>>> any more with PAGE_SIZE being dropped. :-)
>>>
> Maybe you should use STUB_BUF_SIZE instead of PAGE_SIZE?
 No - that would be incorrect because of the physical packing of stubs
 which occurs.

> BTW: Is there any reason we don't use a common stub page mapped to each
> per-cpu stack area? The stack address can easily be obtained via %rip
> relative addressing then (see my patch for the per-vcpu stacks:
> https://lists.xen.org/archives/html/xen-devel/2018-02/msg00773.html )
 I don't understand what you are asking here.  We cannot access the
 per-cpu area with plain rip-retaliative addressing without using gs base
 (and we really don't want to go down that route), or without per-cpu
 pagetables (which would have to be a compile time choice).
>>> The stub-page of a cpu is currently mapped as the 3rd page of the
>>> stack area. So the distance to the primary stack would be the same
>>> for all cpus (a little bit less than 20kB).
>>>
 As for why the per-cpu areas aren't mapped, that's because they aren't
 needed at the moment.  Any decision to change this needs to weigh the
 utility of mapping the areas vs the additional data leakage, which is
 substantial.
>>> The stack area is mapped. And that's where the stub is living.
>> Oh, did I mix up something? I followed the comments in current.h. The
>> code suggests the syscall trampoline page isn't used at the moment for
>> the stubs...
> 
> That will be stale from the work Jan did to make 

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Andrew Cooper
On 14/02/18 12:15, Juergen Gross wrote:
> On 14/02/18 13:03, Juergen Gross wrote:
>> On 14/02/18 12:48, Andrew Cooper wrote:
>>> On 14/02/18 07:54, Juergen Gross wrote:
 On 13/02/18 20:45, Andrew Cooper wrote:
> The current XPTI implementation isolates the directmap (and therefore a 
> lot of
> guest data), but a large quantity of CPU0's state (including its stack)
> remains visible.
>
> Furthermore, an attacker able to read .text is in a vastly superior 
> position
> to normal when it comes to fingerprinting Xen for known vulnerabilities, 
> or
> scanning for ROP/Spectre gadgets.
>
> Collect together the entrypoints in .text.entry (currently 3x4k frames, 
> but
> can almost certainly be slimmed down), and create a common mapping which 
> is
> inserted into each per-cpu shadow.  The stubs are also inserted into this
> mapping by pointing at the in-use L2.  This allows stubs allocated later 
> (SMP
> boot, or CPU hotplug) to work without further changes to the common 
> mappings.
>
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Wei Liu 
> CC: Juergen Gross 
>
> RFC, because I don't think the stubs handling is particularly sensible.
>
> We allocate 4k of virtual address space per CPU, but squash loads of CPUs
> together onto a single MFN.  The stubs ought to be isolated as well (as 
> they
> leak the virtual addresses of each stack), which can be done by 
> allocating an
> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this 
> point, we
> can't use a common set of mappings, and will have to clone the single 
> stub and
> .entry.text into each PCPUs copy of the pagetables.
>
> Also, my plan to cause .text.entry to straddle a 512TB boundary (and 
> therefore
> avoid any further pagetable allocations) has come a little unstuck 
> because of
> CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
> rearrange the virtual layout for this plan to work.
> ---
>  xen/arch/x86/smpboot.c | 37 
> -
>  xen/arch/x86/x86_64/compat/entry.S |  2 ++
>  xen/arch/x86/x86_64/entry.S|  4 +++-
>  xen/arch/x86/xen.lds.S |  7 +++
>  4 files changed, 44 insertions(+), 6 deletions(-)
>
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index 2ebef03..2519141 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, 
> unsigned long *mfn)
>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, 
> PAGE_SIZE));
>  }
>  
> +/* Confirm that all stubs fit in a single L2 pagetable. */
> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
 So we limit NR_CPUS to be max 512 now?
>>> Not intentionally.  The PAGE_SIZE should be dropped.  (One full L2
>>> pagetable allows us to map 512*512 pages).
>> L2_PAGETABLE_SHIFT is 21. So I still don't get why dropping PAGE_SIZE
>> will correct it. OTOH I'm quite sure the BUILD_BUG_ON() won't trigger
>> any more with PAGE_SIZE being dropped. :-)
>>
 Maybe you should use STUB_BUF_SIZE instead of PAGE_SIZE?
>>> No - that would be incorrect because of the physical packing of stubs
>>> which occurs.
>>>
 BTW: Is there any reason we don't use a common stub page mapped to each
 per-cpu stack area? The stack address can easily be obtained via %rip
 relative addressing then (see my patch for the per-vcpu stacks:
 https://lists.xen.org/archives/html/xen-devel/2018-02/msg00773.html )
>>> I don't understand what you are asking here.  We cannot access the
>>> per-cpu area with plain rip-retaliative addressing without using gs base
>>> (and we really don't want to go down that route), or without per-cpu
>>> pagetables (which would have to be a compile time choice).
>> The stub-page of a cpu is currently mapped as the 3rd page of the
>> stack area. So the distance to the primary stack would be the same
>> for all cpus (a little bit less than 20kB).
>>
>>> As for why the per-cpu areas aren't mapped, that's because they aren't
>>> needed at the moment.  Any decision to change this needs to weigh the
>>> utility of mapping the areas vs the additional data leakage, which is
>>> substantial.
>> The stack area is mapped. And that's where the stub is living.
> Oh, did I mix up something? I followed the comments in current.h. The
> code suggests the syscall trampoline page isn't used at the moment for
> the stubs...

That will be stale from the work Jan did to make the stack fully NX. 
The syscall stubs used to be on the stack, but are no longer.

~Andrew

___

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Juergen Gross
On 14/02/18 13:03, Juergen Gross wrote:
> On 14/02/18 12:48, Andrew Cooper wrote:
>> On 14/02/18 07:54, Juergen Gross wrote:
>>> On 13/02/18 20:45, Andrew Cooper wrote:
 The current XPTI implementation isolates the directmap (and therefore a 
 lot of
 guest data), but a large quantity of CPU0's state (including its stack)
 remains visible.

 Furthermore, an attacker able to read .text is in a vastly superior 
 position
 to normal when it comes to fingerprinting Xen for known vulnerabilities, or
 scanning for ROP/Spectre gadgets.

 Collect together the entrypoints in .text.entry (currently 3x4k frames, but
 can almost certainly be slimmed down), and create a common mapping which is
 inserted into each per-cpu shadow.  The stubs are also inserted into this
 mapping by pointing at the in-use L2.  This allows stubs allocated later 
 (SMP
 boot, or CPU hotplug) to work without further changes to the common 
 mappings.

 Signed-off-by: Andrew Cooper 
 ---
 CC: Jan Beulich 
 CC: Wei Liu 
 CC: Juergen Gross 

 RFC, because I don't think the stubs handling is particularly sensible.

 We allocate 4k of virtual address space per CPU, but squash loads of CPUs
 together onto a single MFN.  The stubs ought to be isolated as well (as 
 they
 leak the virtual addresses of each stack), which can be done by allocating 
 an
 MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this point, 
 we
 can't use a common set of mappings, and will have to clone the single stub 
 and
 .entry.text into each PCPUs copy of the pagetables.

 Also, my plan to cause .text.entry to straddle a 512TB boundary (and 
 therefore
 avoid any further pagetable allocations) has come a little unstuck because 
 of
 CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
 rearrange the virtual layout for this plan to work.
 ---
  xen/arch/x86/smpboot.c | 37 
 -
  xen/arch/x86/x86_64/compat/entry.S |  2 ++
  xen/arch/x86/x86_64/entry.S|  4 +++-
  xen/arch/x86/xen.lds.S |  7 +++
  4 files changed, 44 insertions(+), 6 deletions(-)

 diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
 index 2ebef03..2519141 100644
 --- a/xen/arch/x86/smpboot.c
 +++ b/xen/arch/x86/smpboot.c
 @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, 
 unsigned long *mfn)
  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
  }
  
 +/* Confirm that all stubs fit in a single L2 pagetable. */
 +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
>>> So we limit NR_CPUS to be max 512 now?
>>
>> Not intentionally.  The PAGE_SIZE should be dropped.  (One full L2
>> pagetable allows us to map 512*512 pages).
> 
> L2_PAGETABLE_SHIFT is 21. So I still don't get why dropping PAGE_SIZE
> will correct it. OTOH I'm quite sure the BUILD_BUG_ON() won't trigger
> any more with PAGE_SIZE being dropped. :-)
> 
>>> Maybe you should use STUB_BUF_SIZE instead of PAGE_SIZE?
>>
>> No - that would be incorrect because of the physical packing of stubs
>> which occurs.
>>
>>> BTW: Is there any reason we don't use a common stub page mapped to each
>>> per-cpu stack area? The stack address can easily be obtained via %rip
>>> relative addressing then (see my patch for the per-vcpu stacks:
>>> https://lists.xen.org/archives/html/xen-devel/2018-02/msg00773.html )
>>
>> I don't understand what you are asking here.  We cannot access the
>> per-cpu area with plain rip-retaliative addressing without using gs base
>> (and we really don't want to go down that route), or without per-cpu
>> pagetables (which would have to be a compile time choice).
> 
> The stub-page of a cpu is currently mapped as the 3rd page of the
> stack area. So the distance to the primary stack would be the same
> for all cpus (a little bit less than 20kB).
> 
>> As for why the per-cpu areas aren't mapped, that's because they aren't
>> needed at the moment.  Any decision to change this needs to weigh the
>> utility of mapping the areas vs the additional data leakage, which is
>> substantial.
> 
> The stack area is mapped. And that's where the stub is living.

Oh, did I mix up something? I followed the comments in current.h. The
code suggests the syscall trampoline page isn't used at the moment for
the stubs...


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Juergen Gross
On 14/02/18 12:48, Andrew Cooper wrote:
> On 14/02/18 07:54, Juergen Gross wrote:
>> On 13/02/18 20:45, Andrew Cooper wrote:
>>> The current XPTI implementation isolates the directmap (and therefore a lot 
>>> of
>>> guest data), but a large quantity of CPU0's state (including its stack)
>>> remains visible.
>>>
>>> Furthermore, an attacker able to read .text is in a vastly superior position
>>> to normal when it comes to fingerprinting Xen for known vulnerabilities, or
>>> scanning for ROP/Spectre gadgets.
>>>
>>> Collect together the entrypoints in .text.entry (currently 3x4k frames, but
>>> can almost certainly be slimmed down), and create a common mapping which is
>>> inserted into each per-cpu shadow.  The stubs are also inserted into this
>>> mapping by pointing at the in-use L2.  This allows stubs allocated later 
>>> (SMP
>>> boot, or CPU hotplug) to work without further changes to the common 
>>> mappings.
>>>
>>> Signed-off-by: Andrew Cooper 
>>> ---
>>> CC: Jan Beulich 
>>> CC: Wei Liu 
>>> CC: Juergen Gross 
>>>
>>> RFC, because I don't think the stubs handling is particularly sensible.
>>>
>>> We allocate 4k of virtual address space per CPU, but squash loads of CPUs
>>> together onto a single MFN.  The stubs ought to be isolated as well (as they
>>> leak the virtual addresses of each stack), which can be done by allocating 
>>> an
>>> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this point, 
>>> we
>>> can't use a common set of mappings, and will have to clone the single stub 
>>> and
>>> .entry.text into each PCPUs copy of the pagetables.
>>>
>>> Also, my plan to cause .text.entry to straddle a 512TB boundary (and 
>>> therefore
>>> avoid any further pagetable allocations) has come a little unstuck because 
>>> of
>>> CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
>>> rearrange the virtual layout for this plan to work.
>>> ---
>>>  xen/arch/x86/smpboot.c | 37 
>>> -
>>>  xen/arch/x86/x86_64/compat/entry.S |  2 ++
>>>  xen/arch/x86/x86_64/entry.S|  4 +++-
>>>  xen/arch/x86/xen.lds.S |  7 +++
>>>  4 files changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>>> index 2ebef03..2519141 100644
>>> --- a/xen/arch/x86/smpboot.c
>>> +++ b/xen/arch/x86/smpboot.c
>>> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, 
>>> unsigned long *mfn)
>>>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
>>>  }
>>>  
>>> +/* Confirm that all stubs fit in a single L2 pagetable. */
>>> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
>> So we limit NR_CPUS to be max 512 now?
> 
> Not intentionally.  The PAGE_SIZE should be dropped.  (One full L2
> pagetable allows us to map 512*512 pages).

L2_PAGETABLE_SHIFT is 21. So I still don't get why dropping PAGE_SIZE
will correct it. OTOH I'm quite sure the BUILD_BUG_ON() won't trigger
any more with PAGE_SIZE being dropped. :-)

>> Maybe you should use STUB_BUF_SIZE instead of PAGE_SIZE?
> 
> No - that would be incorrect because of the physical packing of stubs
> which occurs.
> 
>> BTW: Is there any reason we don't use a common stub page mapped to each
>> per-cpu stack area? The stack address can easily be obtained via %rip
>> relative addressing then (see my patch for the per-vcpu stacks:
>> https://lists.xen.org/archives/html/xen-devel/2018-02/msg00773.html )
> 
> I don't understand what you are asking here.  We cannot access the
> per-cpu area with plain rip-retaliative addressing without using gs base
> (and we really don't want to go down that route), or without per-cpu
> pagetables (which would have to be a compile time choice).

The stub-page of a cpu is currently mapped as the 3rd page of the
stack area. So the distance to the primary stack would be the same
for all cpus (a little bit less than 20kB).

> As for why the per-cpu areas aren't mapped, that's because they aren't
> needed at the moment.  Any decision to change this needs to weigh the
> utility of mapping the areas vs the additional data leakage, which is
> substantial.

The stack area is mapped. And that's where the stub is living.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Andrew Cooper
On 14/02/18 07:54, Juergen Gross wrote:
> On 13/02/18 20:45, Andrew Cooper wrote:
>> The current XPTI implementation isolates the directmap (and therefore a lot 
>> of
>> guest data), but a large quantity of CPU0's state (including its stack)
>> remains visible.
>>
>> Furthermore, an attacker able to read .text is in a vastly superior position
>> to normal when it comes to fingerprinting Xen for known vulnerabilities, or
>> scanning for ROP/Spectre gadgets.
>>
>> Collect together the entrypoints in .text.entry (currently 3x4k frames, but
>> can almost certainly be slimmed down), and create a common mapping which is
>> inserted into each per-cpu shadow.  The stubs are also inserted into this
>> mapping by pointing at the in-use L2.  This allows stubs allocated later (SMP
>> boot, or CPU hotplug) to work without further changes to the common mappings.
>>
>> Signed-off-by: Andrew Cooper 
>> ---
>> CC: Jan Beulich 
>> CC: Wei Liu 
>> CC: Juergen Gross 
>>
>> RFC, because I don't think the stubs handling is particularly sensible.
>>
>> We allocate 4k of virtual address space per CPU, but squash loads of CPUs
>> together onto a single MFN.  The stubs ought to be isolated as well (as they
>> leak the virtual addresses of each stack), which can be done by allocating an
>> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this point, we
>> can't use a common set of mappings, and will have to clone the single stub 
>> and
>> .entry.text into each PCPUs copy of the pagetables.
>>
>> Also, my plan to cause .text.entry to straddle a 512TB boundary (and 
>> therefore
>> avoid any further pagetable allocations) has come a little unstuck because of
>> CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
>> rearrange the virtual layout for this plan to work.
>> ---
>>  xen/arch/x86/smpboot.c | 37 
>> -
>>  xen/arch/x86/x86_64/compat/entry.S |  2 ++
>>  xen/arch/x86/x86_64/entry.S|  4 +++-
>>  xen/arch/x86/xen.lds.S |  7 +++
>>  4 files changed, 44 insertions(+), 6 deletions(-)
>>
>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>> index 2ebef03..2519141 100644
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned 
>> long *mfn)
>>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
>>  }
>>  
>> +/* Confirm that all stubs fit in a single L2 pagetable. */
>> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
> So we limit NR_CPUS to be max 512 now?

Not intentionally.  The PAGE_SIZE should be dropped.  (One full L2
pagetable allows us to map 512*512 pages).

> Maybe you should use STUB_BUF_SIZE instead of PAGE_SIZE?

No - that would be incorrect because of the physical packing of stubs
which occurs.

> BTW: Is there any reason we don't use a common stub page mapped to each
> per-cpu stack area? The stack address can easily be obtained via %rip
> relative addressing then (see my patch for the per-vcpu stacks:
> https://lists.xen.org/archives/html/xen-devel/2018-02/msg00773.html )

I don't understand what you are asking here.  We cannot access the
per-cpu area with plain rip-retaliative addressing without using gs base
(and we really don't want to go down that route), or without per-cpu
pagetables (which would have to be a compile time choice).

As for why the per-cpu areas aren't mapped, that's because they aren't
needed at the moment.  Any decision to change this needs to weigh the
utility of mapping the areas vs the additional data leakage, which is
substantial.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-14 Thread Jan Beulich
>>> On 13.02.18 at 20:45,  wrote:
> RFC, because I don't think the stubs handling is particularly sensible.
> 
> We allocate 4k of virtual address space per CPU, but squash loads of CPUs
> together onto a single MFN.  The stubs ought to be isolated as well (as they
> leak the virtual addresses of each stack), which can be done by allocating an
> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this point, we
> can't use a common set of mappings, and will have to clone the single stub and
> .entry.text into each PCPUs copy of the pagetables.

The 4k-per-CPU allocation of VA space is probably not strictly
necessary - quoting the original commit message:
"While sharing physical pages among certain CPUs on the same node, for
 now the virtual mappings get established in distinct pages for each
 CPU. This isn't a strict requirement, but simplifies VA space
 management for this initial implementation: Sharing VA space would
 require additional tracking of which areas are currently in use. If
 the VA and/or TLB overhead turned out to be a problem, such extra code
 could easily be added."

Without allocating a page per CPU I think what you do is sensible.
And I don't see how hiding stubs of other CPUs would help much -
an attacker can gather stack locations from various CPUs as its
vCPU is being moved around, and you can't hide the current CPU's
stub space.

> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned 
> long *mfn)
>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
>  }
>  
> +/* Confirm that all stubs fit in a single L2 pagetable. */
> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));

Perhaps duplicate this in setup_cpu_root_pgt() (suitably adjusted
of course, as Jürgen has already pointed out)?

> @@ -651,9 +654,6 @@ static int clone_mapping(const void *ptr, root_pgentry_t 
> *rpt)
>  l2_pgentry_t *pl2e;
>  l1_pgentry_t *pl1e;
>  
> -if ( linear < DIRECTMAP_VIRT_START )
> -return 0;

Isn't outright removing this going a little too far?

> @@ -744,6 +744,9 @@ static __read_mostly int8_t opt_xpti = -1;
>  boolean_param("xpti", opt_xpti);
>  DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
>  
> +static root_pgentry_t common_pgt;

Move into setup_cpu_root_pgt()?

> +extern char _stextentry[], _etextentry[];

const?

> --- a/xen/arch/x86/x86_64/compat/entry.S
> +++ b/xen/arch/x86/x86_64/compat/entry.S
> @@ -13,6 +13,8 @@
>  #include 
>  #include 
>  
> +.section .text.entry, "ax", @progbits
> +
>  ENTRY(entry_int82)
>  ASM_CLAC
>  pushq $0

This also puts compat_create_bounce_frame into the entry section,
which surely isn't needed. Same for the non-compat variant.

> @@ -854,7 +856,7 @@ GLOBAL(autogen_entrypoints)
>  .popsection
>  .endm
>  
> -.text
> +.previous
>  autogen_stubs: /* Automatically generated stubs. */

Perhaps better to switch the earlier .section to .pushsection, and
use .popsection here?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-13 Thread Juergen Gross
On 13/02/18 20:45, Andrew Cooper wrote:
> The current XPTI implementation isolates the directmap (and therefore a lot of
> guest data), but a large quantity of CPU0's state (including its stack)
> remains visible.
> 
> Furthermore, an attacker able to read .text is in a vastly superior position
> to normal when it comes to fingerprinting Xen for known vulnerabilities, or
> scanning for ROP/Spectre gadgets.
> 
> Collect together the entrypoints in .text.entry (currently 3x4k frames, but
> can almost certainly be slimmed down), and create a common mapping which is
> inserted into each per-cpu shadow.  The stubs are also inserted into this
> mapping by pointing at the in-use L2.  This allows stubs allocated later (SMP
> boot, or CPU hotplug) to work without further changes to the common mappings.
> 
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Wei Liu 
> CC: Juergen Gross 
> 
> RFC, because I don't think the stubs handling is particularly sensible.
> 
> We allocate 4k of virtual address space per CPU, but squash loads of CPUs
> together onto a single MFN.  The stubs ought to be isolated as well (as they
> leak the virtual addresses of each stack), which can be done by allocating an
> MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this point, we
> can't use a common set of mappings, and will have to clone the single stub and
> .entry.text into each PCPUs copy of the pagetables.
> 
> Also, my plan to cause .text.entry to straddle a 512TB boundary (and therefore
> avoid any further pagetable allocations) has come a little unstuck because of
> CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
> rearrange the virtual layout for this plan to work.
> ---
>  xen/arch/x86/smpboot.c | 37 -
>  xen/arch/x86/x86_64/compat/entry.S |  2 ++
>  xen/arch/x86/x86_64/entry.S|  4 +++-
>  xen/arch/x86/xen.lds.S |  7 +++
>  4 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index 2ebef03..2519141 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned 
> long *mfn)
>  unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
>  }
>  
> +/* Confirm that all stubs fit in a single L2 pagetable. */
> +BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));

So we limit NR_CPUS to be max 512 now?

Maybe you should use STUB_BUF_SIZE instead of PAGE_SIZE?

BTW: Is there any reason we don't use a common stub page mapped to each
per-cpu stack area? The stack address can easily be obtained via %rip
relative addressing then (see my patch for the per-vcpu stacks:
https://lists.xen.org/archives/html/xen-devel/2018-02/msg00773.html )


Juergen

> +
>  stub_va = XEN_VIRT_END - (cpu + 1) * PAGE_SIZE;
>  if ( map_pages_to_xen(stub_va, mfn_x(page_to_mfn(pg)), 1,
>PAGE_HYPERVISOR_RX | MAP_SMALL_PAGES) )
> @@ -651,9 +654,6 @@ static int clone_mapping(const void *ptr, root_pgentry_t 
> *rpt)
>  l2_pgentry_t *pl2e;
>  l1_pgentry_t *pl1e;
>  
> -if ( linear < DIRECTMAP_VIRT_START )
> -return 0;
> -
>  flags = l3e_get_flags(*pl3e);
>  ASSERT(flags & _PAGE_PRESENT);
>  if ( flags & _PAGE_PSE )
> @@ -744,6 +744,9 @@ static __read_mostly int8_t opt_xpti = -1;
>  boolean_param("xpti", opt_xpti);
>  DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
>  
> +static root_pgentry_t common_pgt;
> +extern char _stextentry[], _etextentry[];
> +
>  static int setup_cpu_root_pgt(unsigned int cpu)
>  {
>  root_pgentry_t *rpt;
> @@ -764,8 +767,32 @@ static int setup_cpu_root_pgt(unsigned int cpu)
>  idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
>  /* SH_LINEAR_PT inserted together with guest mappings. */
>  /* PERDOMAIN inserted during context switch. */
> -rpt[root_table_offset(XEN_VIRT_START)] =
> -idle_pg_table[root_table_offset(XEN_VIRT_START)];
> +
> +/* One-time setup of common_pgt, which maps .text.entry and the stubs. */
> +if ( unlikely(!root_get_intpte(common_pgt)) )
> +{
> +unsigned long stubs_linear = XEN_VIRT_END - 1;
> +l3_pgentry_t *stubs_main, *stubs_shadow;
> +char *ptr;
> +
> +for ( rc = 0, ptr = _stextentry;
> +  !rc && ptr < _etextentry; ptr += PAGE_SIZE )
> +rc = clone_mapping(ptr, rpt);
> +
> +if ( rc )
> +return rc;
> +
> +stubs_main = 
> l4e_to_l3e(idle_pg_table[l4_table_offset(stubs_linear)]);
> +stubs_shadow = l4e_to_l3e(rpt[l4_table_offset(stubs_linear)]);
> +
> +/* Splice into the regular L2 mapping the stubs. */
> +stubs_shadow[l3_table_offset(stubs_linear)] =
> +stubs_main[l3_table_offset(stubs_linear)];
> +
> + 

[Xen-devel] [PATCH] x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings

2018-02-13 Thread Andrew Cooper
The current XPTI implementation isolates the directmap (and therefore a lot of
guest data), but a large quantity of CPU0's state (including its stack)
remains visible.

Furthermore, an attacker able to read .text is in a vastly superior position
to normal when it comes to fingerprinting Xen for known vulnerabilities, or
scanning for ROP/Spectre gadgets.

Collect together the entrypoints in .text.entry (currently 3x4k frames, but
can almost certainly be slimmed down), and create a common mapping which is
inserted into each per-cpu shadow.  The stubs are also inserted into this
mapping by pointing at the in-use L2.  This allows stubs allocated later (SMP
boot, or CPU hotplug) to work without further changes to the common mappings.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Wei Liu 
CC: Juergen Gross 

RFC, because I don't think the stubs handling is particularly sensible.

We allocate 4k of virtual address space per CPU, but squash loads of CPUs
together onto a single MFN.  The stubs ought to be isolated as well (as they
leak the virtual addresses of each stack), which can be done by allocating an
MFN per CPU (and simplifies cpu_smpboot_alloc() somewhat).  At this point, we
can't use a common set of mappings, and will have to clone the single stub and
.entry.text into each PCPUs copy of the pagetables.

Also, my plan to cause .text.entry to straddle a 512TB boundary (and therefore
avoid any further pagetable allocations) has come a little unstuck because of
CONFIG_BIGMEM.  I'm still working out whether there is a sensible way to
rearrange the virtual layout for this plan to work.
---
 xen/arch/x86/smpboot.c | 37 -
 xen/arch/x86/x86_64/compat/entry.S |  2 ++
 xen/arch/x86/x86_64/entry.S|  4 +++-
 xen/arch/x86/xen.lds.S |  7 +++
 4 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 2ebef03..2519141 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -622,6 +622,9 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned 
long *mfn)
 unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
 }
 
+/* Confirm that all stubs fit in a single L2 pagetable. */
+BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L2_PAGETABLE_SHIFT));
+
 stub_va = XEN_VIRT_END - (cpu + 1) * PAGE_SIZE;
 if ( map_pages_to_xen(stub_va, mfn_x(page_to_mfn(pg)), 1,
   PAGE_HYPERVISOR_RX | MAP_SMALL_PAGES) )
@@ -651,9 +654,6 @@ static int clone_mapping(const void *ptr, root_pgentry_t 
*rpt)
 l2_pgentry_t *pl2e;
 l1_pgentry_t *pl1e;
 
-if ( linear < DIRECTMAP_VIRT_START )
-return 0;
-
 flags = l3e_get_flags(*pl3e);
 ASSERT(flags & _PAGE_PRESENT);
 if ( flags & _PAGE_PSE )
@@ -744,6 +744,9 @@ static __read_mostly int8_t opt_xpti = -1;
 boolean_param("xpti", opt_xpti);
 DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
 
+static root_pgentry_t common_pgt;
+extern char _stextentry[], _etextentry[];
+
 static int setup_cpu_root_pgt(unsigned int cpu)
 {
 root_pgentry_t *rpt;
@@ -764,8 +767,32 @@ static int setup_cpu_root_pgt(unsigned int cpu)
 idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
 /* SH_LINEAR_PT inserted together with guest mappings. */
 /* PERDOMAIN inserted during context switch. */
-rpt[root_table_offset(XEN_VIRT_START)] =
-idle_pg_table[root_table_offset(XEN_VIRT_START)];
+
+/* One-time setup of common_pgt, which maps .text.entry and the stubs. */
+if ( unlikely(!root_get_intpte(common_pgt)) )
+{
+unsigned long stubs_linear = XEN_VIRT_END - 1;
+l3_pgentry_t *stubs_main, *stubs_shadow;
+char *ptr;
+
+for ( rc = 0, ptr = _stextentry;
+  !rc && ptr < _etextentry; ptr += PAGE_SIZE )
+rc = clone_mapping(ptr, rpt);
+
+if ( rc )
+return rc;
+
+stubs_main = l4e_to_l3e(idle_pg_table[l4_table_offset(stubs_linear)]);
+stubs_shadow = l4e_to_l3e(rpt[l4_table_offset(stubs_linear)]);
+
+/* Splice into the regular L2 mapping the stubs. */
+stubs_shadow[l3_table_offset(stubs_linear)] =
+stubs_main[l3_table_offset(stubs_linear)];
+
+common_pgt = rpt[root_table_offset(XEN_VIRT_START)];
+}
+
+rpt[root_table_offset(XEN_VIRT_START)] = common_pgt;
 
 /* Install direct map page table entries for stack, IDT, and TSS. */
 for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
diff --git a/xen/arch/x86/x86_64/compat/entry.S 
b/xen/arch/x86/x86_64/compat/entry.S
index 707c746..b001e79 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -13,6 +13,8 @@
 #include 
 #include 
 
+.section .text.entry, "ax", @progbits
+
 ENTRY(entry_int82)
 ASM_CLAC
 pushq $0
diff --git