Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Jan Beulich
>>> On 23.01.18 at 18:33,  wrote:
> Well at very least there should be something in the boot scrool that
> says, "Enabling Xen Pagetable protection (XPTI) for PV guests" or
> something.  (That goes for the current round of XPTI as well really.)

And indeed I have this on my list of follow-up things, but didn't get
to it yet.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread George Dunlap
On 01/23/2018 04:56 PM, Juergen Gross wrote:
> On 23/01/18 17:45, George Dunlap wrote:
>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>> Juergen: you're now adding a LTR into the context switch path which
>>> tends to be very slow.  I.e. As currently presented, this series
>>> necessarily has a higher runtime overhead than Jan's XPTI.
>>
>> So here are a repeat of the "hypervisor compile" tests I did, comparing
>> the different XPTI-like series so far.
>>
>> # Experimental setup:
>> Host:
>>  - Intel(R) Xeon(R) CPU E5630  @ 2.53GHz
>>  - 4 pcpus
>>  - Memory: 4GiB
>> Guest:
>>  - 4vcpus, 512MiB, blkback to raw file
>>  - CentOS 6 userspace
>>  - Linux 4.14 kernel with PV / PVH / PVHVM / KVM guest support (along
>> with expected drivers) built-in
>> Test:
>>  - cd xen-4.10.0
>>  - make -C xen clean
>>  - time make -j 4 xen
>>
> 
> ...
> 
>> * Staging + Juergen's v2 series
>> real1m3.018s
>> user2m52.217s
>> sys 0m40.357s
>>
>> Result: 63s (0% overhead)
>>
>> Unfortunately, I can't really verify that Juergen's patches are having
>> any effect; there's no printk indicating whether it's enabling the
>> mitigation or not.  I have verified that the changeset reported in `xl
>> dmesg` corresponds to the branch I have with the patches applied.
>>
>> So it's *possible* something has gotten mixed up, and the mitigation
>> isn't being applied; but if it *is* applied, the performance is
>> significantly better than the "band-aid" XPTI.
> 
> As there is no real mitigation in place, but only the needed rework of
> the interrupt handling and context switching, anything not next to
> xpti=off would have been disappointing for me. :-)
> 
> I'll add some statistics in the next patches so it can be verified the
> patches are really doing something.

Well at very least there should be something in the boot scrool that
says, "Enabling Xen Pagetable protection (XPTI) for PV guests" or
something.  (That goes for the current round of XPTI as well really.)

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Juergen Gross
On 23/01/18 17:45, George Dunlap wrote:
> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>> Juergen: you're now adding a LTR into the context switch path which
>> tends to be very slow.  I.e. As currently presented, this series
>> necessarily has a higher runtime overhead than Jan's XPTI.
> 
> So here are a repeat of the "hypervisor compile" tests I did, comparing
> the different XPTI-like series so far.
> 
> # Experimental setup:
> Host:
>  - Intel(R) Xeon(R) CPU E5630  @ 2.53GHz
>  - 4 pcpus
>  - Memory: 4GiB
> Guest:
>  - 4vcpus, 512MiB, blkback to raw file
>  - CentOS 6 userspace
>  - Linux 4.14 kernel with PV / PVH / PVHVM / KVM guest support (along
> with expected drivers) built-in
> Test:
>  - cd xen-4.10.0
>  - make -C xen clean
>  - time make -j 4 xen
> 

...

> * Staging + Juergen's v2 series
> real1m3.018s
> user2m52.217s
> sys 0m40.357s
> 
> Result: 63s (0% overhead)
> 
> Unfortunately, I can't really verify that Juergen's patches are having
> any effect; there's no printk indicating whether it's enabling the
> mitigation or not.  I have verified that the changeset reported in `xl
> dmesg` corresponds to the branch I have with the patches applied.
> 
> So it's *possible* something has gotten mixed up, and the mitigation
> isn't being applied; but if it *is* applied, the performance is
> significantly better than the "band-aid" XPTI.

As there is no real mitigation in place, but only the needed rework of
the interrupt handling and context switching, anything not next to
xpti=off would have been disappointing for me. :-)

I'll add some statistics in the next patches so it can be verified the
patches are really doing something.

Thanks for doing the tests,


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread George Dunlap
On 01/22/2018 06:39 PM, Andrew Cooper wrote:
> On 22/01/18 16:51, Jan Beulich wrote:
> On 22.01.18 at 16:00,  wrote:
>>> On 22/01/18 15:48, Jan Beulich wrote:
>>> On 22.01.18 at 15:38,  wrote:
> On 22/01/18 15:22, Jan Beulich wrote:
> On 22.01.18 at 15:18,  wrote:
>>> On 22/01/18 13:50, Jan Beulich wrote:
>>> On 22.01.18 at 13:32,  wrote:
> As a preparation for doing page table isolation in the Xen hypervisor
> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
> 64 bit PV domains mapped to the per-domain virtual area.
>
> The per-vcpu stacks are used for early interrupt handling only. After
> saving the domain's registers stacks are switched back to the normal
> per physical cpu ones in order to be able to address on-stack data
> from other cpus e.g. while handling IPIs.
>
> Adding %cr3 switching between saving of the registers and switching
> the stacks will enable the possibility to run guest code without any
> per physical cpu mapping, i.e. avoiding the threat of a guest being
> able to access other domains data.
>
> Without any further measures it will still be possible for e.g. a
> guest's user program to read stack data of another vcpu of the same
> domain, but this can be easily avoided by a little PV-ABI modification
> introducing per-cpu user address spaces.
>
> This series is meant as a replacement for Andrew's patch series:
> "x86: Prerequisite work for a Xen KAISER solution".
 Considering in particular the two reverts, what I'm missing here
 is a clear description of the meaningful additional protection this
 approach provides over the band-aid. For context see also
 https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
  
>>> My approach supports mapping only the following data while the guest is
>>> running (apart form the guest's own data, of course):
>>>
>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>   guest's registers saved when an interrupt occurs
>>> - the per-vcpu GDTs and TSSs of the domain
>>> - the IDT
>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>
>>> All other hypervisor data and code can be completely hidden from the
>>> guests.
>> I understand that. What I'm not clear about is: Which parts of
>> the additionally hidden data are actually necessary (or at least
>> very desirable) to hide?
> Necessary:
> - other guests' memory (e.g. physical memory 1:1 mapping)
> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>   code emulator buffers
> - other guests' register values e.g. in vcpu structure
 All of this is already being made invisible by the band-aid (with the
 exception of leftovers on the hypervisor stacks across context
 switches, which we've already said could be taken care of by
 memset()ing that area). I'm asking about the _additional_ benefits
 of your approach.
>>> I'm quite sure the performance will be much better as it doesn't require
>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>> guest L4 table, similar to the Linux kernel KPTI approach.
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> (Condensing a lot of threads down into one)
> 
> All the methods have L4 synchronisation update issues, until we have a
> PV ABI which guarantees that L4's don't get reused.  Any improvements to
> the shadowing/synchronisation algorithm will benefit all approaches.
> 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.

So here are a repeat of the "hypervisor compile" tests I did, comparing
the different XPTI-like series so far.

# Experimental setup:
Host:
 - Intel(R) Xeon(R) CPU E5630  @ 2.53GHz
 - 4 pcpus
 - Memory: 4GiB
Guest:
 - 4vcpus, 512MiB, blkback to raw file
 - CentOS 6 userspace
 - Linux 4.14 kernel with PV / PVH / PVHVM / KVM guest support (along
with expected drivers) built-in
Test:
 - cd xen-4.10.0
 - make -C xen clean
 - time make -j 4 xen

# Results
- In all cases, running a "default" build with CONFIG_DEBUG=n

* Staging, xpti=off
real1m2.995s
user2m52.527s
sys 0m40.276s

Result: 63s

* Staging [xpti default]
real1m27.190s
user3m3.900s
sys 1m42.686s

Result: 87s (38% overhead)

Note also that the "system time" here is about 2.5x of "xpti=off"; so
total wasted cpu time is significantly higher.

* Staging + "x86: slightly reduce Meltdown band-aid overhead"
real1m21.661s
user

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Juergen Gross
On 23/01/18 12:45, Andrew Cooper wrote:
> On 23/01/18 10:10, Juergen Gross wrote:
>> On 23/01/18 10:31, Jan Beulich wrote:
>> On 23.01.18 at 10:24,  wrote:
 On 23/01/18 09:53, Jan Beulich wrote:
 On 23.01.18 at 07:34,  wrote:
>> On 22/01/18 19:39, Andrew Cooper wrote:
>>> One of my concerns is that this patch series moves further away from the
>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>> context switch code can drop a load of its slow instructions like LGDT
>>> and the VMWRITEs to update the VMCS.
>> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
>> see why the IDT can't be mapped to the same address on each cpu with
>> my approach.
> You're not introducing a per-CPU range in the page tables afaics
> (again from overview and titles only), yet with the IDT needing
> to be per-CPU you'd also need a per-CPU range to map it to if
> you want to avoid the LIDT as well as exposing what CPU you're
> on (same goes for the GDT and the respective avoidance of LGDT
> afaict).
 After a quick look I don't see why a Meltdown mitigation can't use
 the same IDT for all cpus: the only reason I could find for having
 per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
 And AMD won't need XPTI at all.
>>> Isn't your RFC series allowing XPTI to be enabled even on AMD?
>> Yes, you are right. This might either want to be revisited or the
>> address space to be activated for SVM domains could map an IDT with
>> IST related traps removed.
> 
> I've experimented quite a lot in this area.  Ideally, we'd vmload/save
> in the SVM critical region (like all other hypervisors) at which point
> we don't need any adjustments to the IDT (as IST references are safe to
> use), and we'd catch stack overflows in the #DF handler rather than
> immediately triple faulting.
> 
> Using LIDT to switch between alternative IDTs, or INVLPG to swap the
> mapping under a fixed linear address are both much slower than the
> current implementation.
> 
>>
 The GDT of pv domains is already in the per-domain region even without
 my patches, so I don't have to change anything regarding usage of LGDT.
>>> Andrew's point was that eliminating the LGDT is a secondary goal.
>> With per-cpu mappings this is surely an obvious optimization. In the
>> end the overall performance should be taken as base for a decision.
>> His main point was avoiding exposing data like the physical cpu number
>> and this doesn't apply here, as the GDT is per vcpu in my case.
> 
> The GDT leaks vcpu_id into guest userspace, which is similarly problematic.

Mind explaining this? Why is leaking the vcpu_id problematic?

> The secondary goals of my KAISER series stand irrespective of the
> Meltdown issues:
> * The stack and mutable critical structures really should be numa-local
> to the CPU using it.
> * The GDT should sit fully fat over zeros.  At the moment in HVM
> context, there are 14 frames of arbitrary directmap living within the
> GDT limit.
> * The IDT/GDT should exist at the same linear address on every pcpu to
> avoid leaking information  (This property is what allows the removal of
> the lgdt from the context switch path).
> * The critical datastructures should be mapped read only to make
> exploitation hardware for an attacker with a write-primative.
> * With the stack at the same linear address on each CPU, we don't need
> the syscall stubs, and the TSS is identical on all cpus.
> 
> In some copious free time, it would be nice to fix these issues.

As long as you can't solve the primary performance problem of your
approach for existing pv guests I don't see why above tuning attempts
would make any sense.

I know for sure there are users out there not capable to switch to HVM
or PVH guests because they need more than 64 vcpus per guest. So before
tackling above problems you really have to solve the large HVM guest
problem. And making it impossible for those users to continue using
PV guests by hitting performance so bad won't be an accepted "solution".


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Dario Faggioli
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hey, Hi!

On Mon, 2018-01-22 at 18:39 +, Andrew Cooper wrote:
> > > > On 22.01.18 at 15:38,  wrote:
> > > I'm quite sure the performance will be much better as it doesn't
> > > require
> > > per physical cpu L4 page tables, but just a shadow L4 table for
> > > each
> > > guest L4 table, similar to the Linux kernel KPTI approach.
> > 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.
> 
So, as Juergen mentioned, I'm trying to do some performance evaluation
of these solutions.

This is just the first set of numbers, so consider it preliminary. In
particular, I'm sure there is a better set of benchmarks than the ones
I've used for now (in order to have something quickly)... I am looking
more into this.

Anyway, what I'm seeing for now is that Juergen's branch performs
pretty much as current staging, if booted with xpti=false (i.e., with
Jan's band-aid compiled but disabled).

OTOH, staging with xpti=true does show some performance impact. I
appreciate that this is still unfair a comparison (as Juergen's series
lacks the "real XPTI" bits), but the goal here was to figure out
whether the current status of the series is already introducing
regressions or not (and, as far as this first set of benches says, it's
not).

Anyway, here's the numbers. The benchmarks are run in a 16 vCPUs Debian
PV guest, on a 16 pCPUs (Intel Xeon-s) Debian host.

Raw numbers:
https://openbenchmarking.org/result/1801238-AL-1801232AL05

Normalized against "Staging xpti=false"
https://openbenchmarking.org/result/1801238-AL-1801232AL05_nor=y_hgv=4.11+Staging+xpti%3Dfalse

You'll have to forgive me about the labels (I'll pick better titles
next time). They're meaning is as follows:
- - "4.11 Staging xpti=false" this is current staging, booted with
  xpti=false (so, with Jan's band-aid applied, but disabled);
- - "staging-xpti-on" this is current staging, booted with xpti=true
  (so, with Jan's band-aid applied, but enabled);
- - "4.11 Juergen xpti" this is Juergen's GitHub branch, booted with
xpti=true.

I'll post more as soon as I will have it.
Dario
- -- 
<> (Raistlin Majere)
- -
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEES5ssOj3Vhr0WPnOLFkJ4iaW4c+4FAlpnN5sACgkQFkJ4iaW4
c+7p3RAAskP/vGfZMermn3iXh5tULiumjWb8olYN/X5bXEcR5HQCuWiK3p647LkA
p9Am1vgTxGSz7GpQJt45f1No0zS/oMcfEDvnV7aJ76sjSQqK/fIwVGec9qIJbgFB
TzvF4rGqiPnbXykq79ps6RFK3bN6PasV/4Yr1cqr0EJRtiYVe/F12UMER32AyOnd
XSCnHMI5yu4Zy6te0CsfxH96TlDoIGsvSKv86xMpc1m87l19yBFfRLUrVZL1Dmfc
zZwaR91IZdlR7N2xKCBgGbbnqRx4HmfNTN49Ih2ND/YISEyQdgNxZdOSxfRKpD2m
yMOzf7huJmUBiwQ+M/tJmC/bn8hVG1ZCwPbuMF5OItXWPnfA/SHBG++NM17LVGTa
tdQC12Gl2DNvQEOns6z9tfnRF/FqnQAnK7KJ1LACAWGmQSIBGCrO+sQlmy/uwRdX
wWpuH4qE7WDBhXtMbN/4b31ab7US4N/ZJcgz/uKgMr8/OUhjYUFSjUwR39gGeruN
b78s019rtEVOUEKNYngzb8FPJP89qnfcDj7sivMmgzq0FIB5VTXXhA4Idvgk6WPb
/RCfyO1SYgWOY0zKAV85lGCcuU7X8SVdTmAjlM7yNVh/WtfHCcah5GZQYBnse3tw
i61CmZpl2eVzzbfeSnA6lU7g9y8jOKkhmPaQ+8dxgV+mMv0Z/Lo=
=/v47
-END PGP SIGNATURE-


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Andrew Cooper
On 23/01/18 10:10, Juergen Gross wrote:
> On 23/01/18 10:31, Jan Beulich wrote:
> On 23.01.18 at 10:24,  wrote:
>>> On 23/01/18 09:53, Jan Beulich wrote:
>>> On 23.01.18 at 07:34,  wrote:
> On 22/01/18 19:39, Andrew Cooper wrote:
>> One of my concerns is that this patch series moves further away from the
>> secondary goal of my KAISER series, which was to have the IDT and GDT
>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>> leak which CPU you're currently scheduled on into PV guests and b) the
>> context switch code can drop a load of its slow instructions like LGDT
>> and the VMWRITEs to update the VMCS.
> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
> see why the IDT can't be mapped to the same address on each cpu with
> my approach.
 You're not introducing a per-CPU range in the page tables afaics
 (again from overview and titles only), yet with the IDT needing
 to be per-CPU you'd also need a per-CPU range to map it to if
 you want to avoid the LIDT as well as exposing what CPU you're
 on (same goes for the GDT and the respective avoidance of LGDT
 afaict).
>>> After a quick look I don't see why a Meltdown mitigation can't use
>>> the same IDT for all cpus: the only reason I could find for having
>>> per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
>>> And AMD won't need XPTI at all.
>> Isn't your RFC series allowing XPTI to be enabled even on AMD?
> Yes, you are right. This might either want to be revisited or the
> address space to be activated for SVM domains could map an IDT with
> IST related traps removed.

I've experimented quite a lot in this area.  Ideally, we'd vmload/save
in the SVM critical region (like all other hypervisors) at which point
we don't need any adjustments to the IDT (as IST references are safe to
use), and we'd catch stack overflows in the #DF handler rather than
immediately triple faulting.

Using LIDT to switch between alternative IDTs, or INVLPG to swap the
mapping under a fixed linear address are both much slower than the
current implementation.

>
>>> The GDT of pv domains is already in the per-domain region even without
>>> my patches, so I don't have to change anything regarding usage of LGDT.
>> Andrew's point was that eliminating the LGDT is a secondary goal.
> With per-cpu mappings this is surely an obvious optimization. In the
> end the overall performance should be taken as base for a decision.
> His main point was avoiding exposing data like the physical cpu number
> and this doesn't apply here, as the GDT is per vcpu in my case.

The GDT leaks vcpu_id into guest userspace, which is similarly problematic.

The secondary goals of my KAISER series stand irrespective of the
Meltdown issues:
* The stack and mutable critical structures really should be numa-local
to the CPU using it.
* The GDT should sit fully fat over zeros.  At the moment in HVM
context, there are 14 frames of arbitrary directmap living within the
GDT limit.
* The IDT/GDT should exist at the same linear address on every pcpu to
avoid leaking information  (This property is what allows the removal of
the lgdt from the context switch path).
* The critical datastructures should be mapped read only to make
exploitation hardware for an attacker with a write-primative.
* With the stack at the same linear address on each CPU, we don't need
the syscall stubs, and the TSS is identical on all cpus.

In some copious free time, it would be nice to fix these issues.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Andrew Cooper
On 23/01/18 08:36, Jan Beulich wrote:
 On 22.01.18 at 20:02,  wrote:
>> On 22/01/18 18:48, George Dunlap wrote:
>>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
 Jan: As to the things not covered by the current XPTI, hiding most of
 the .text section is important to prevent fingerprinting or ROP
 scanning.  This is a defence-in-depth argument, but a guest being easily
 able to identify whether certain XSAs are fixed or not is quite bad. 
>>> I'm afraid we have a fairly different opinion of what is "quite bad".
>> I suggest you try talking to some real users then.
>>
>>> Suppose we handed users a knob and said, "If you flip this switch,
>>> attackers won't be able to tell if you've fixed XSAs or not without
>>> trying them; but it will slow down your guests 20%."  How many do you
>>> think would flip it, and how many would reckon that an attacker could
>>> probably find out that information anyway?
>> Nonsense.  The performance hit is already taken.  The argument is "do
>> you want an attacker able to trivially evaluate security weaknesses in
>> your hypervisor", a process which usually has to be done by guesswork
>> and knowing the exact binary under attack.  Having .text fully readable
>> lowers the barrier to entry substantially.
> I neither agree with George's reply being nonsense, nor do I think
> this is an appropriate tone. _Some_ performance hit is already
> taken. Further hiding of information my incur further loss of
> performance, or are you telling me you can guarantee this never
> ever to happen? Additionally, the amount of "guesswork" may
> heavily depend on the nature of a specific issue. I can imagine
> cases where such guesswork may even turn out easier than using
> some side channel approach like those recent ones.
>
> As indicated earlier, I'm not fundamentally opposed to hiding
> more things, but I'm also not convinced we should hide more stuff
> regardless of the price to pay.

Here is an example which comes with zero extra overhead.

Shuffle the virtual layout to put .text adjacent to MMCFG, and steal
some space (1G?) from the top of MMCFG for .entry.text and the per-cpu
stubs.  With some linker adjustments, relative jumps/references will
even work properly.

Anyone serious about security is not going to be happy with XPTI in its
current form, because being able to arbitrarily read .text is far too
valuable for an attacker.  Anyone serious about performance will turn
the whole lot off.

In some theoretical world with three options, only a fool would choose
the middle option, because a 10% hit is not going to be chosen lightly
in the first place, but there is no point taking the hit with the
reduced security.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread George Dunlap
On 01/22/2018 07:02 PM, Andrew Cooper wrote:
> On 22/01/18 18:48, George Dunlap wrote:
>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>> On 22/01/18 16:51, Jan Beulich wrote:
>>> On 22.01.18 at 16:00,  wrote:
> On 22/01/18 15:48, Jan Beulich wrote:
> On 22.01.18 at 15:38,  wrote:
>>> On 22/01/18 15:22, Jan Beulich wrote:
>>> On 22.01.18 at 15:18,  wrote:
> On 22/01/18 13:50, Jan Beulich wrote:
> On 22.01.18 at 13:32,  wrote:
>>> As a preparation for doing page table isolation in the Xen 
>>> hypervisor
>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS 
>>> for
>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>
>>> The per-vcpu stacks are used for early interrupt handling only. 
>>> After
>>> saving the domain's registers stacks are switched back to the normal
>>> per physical cpu ones in order to be able to address on-stack data
>>> from other cpus e.g. while handling IPIs.
>>>
>>> Adding %cr3 switching between saving of the registers and switching
>>> the stacks will enable the possibility to run guest code without any
>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>> able to access other domains data.
>>>
>>> Without any further measures it will still be possible for e.g. a
>>> guest's user program to read stack data of another vcpu of the same
>>> domain, but this can be easily avoided by a little PV-ABI 
>>> modification
>>> introducing per-cpu user address spaces.
>>>
>>> This series is meant as a replacement for Andrew's patch series:
>>> "x86: Prerequisite work for a Xen KAISER solution".
>> Considering in particular the two reverts, what I'm missing here
>> is a clear description of the meaningful additional protection this
>> approach provides over the band-aid. For context see also
>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>>  
> My approach supports mapping only the following data while the guest 
> is
> running (apart form the guest's own data, of course):
>
> - the per-vcpu entry stacks of the domain which will contain only the
>   guest's registers saved when an interrupt occurs
> - the per-vcpu GDTs and TSSs of the domain
> - the IDT
> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>
> All other hypervisor data and code can be completely hidden from the
> guests.
 I understand that. What I'm not clear about is: Which parts of
 the additionally hidden data are actually necessary (or at least
 very desirable) to hide?
>>> Necessary:
>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>   code emulator buffers
>>> - other guests' register values e.g. in vcpu structure
>> All of this is already being made invisible by the band-aid (with the
>> exception of leftovers on the hypervisor stacks across context
>> switches, which we've already said could be taken care of by
>> memset()ing that area). I'm asking about the _additional_ benefits
>> of your approach.
> I'm quite sure the performance will be much better as it doesn't require
> per physical cpu L4 page tables, but just a shadow L4 table for each
> guest L4 table, similar to the Linux kernel KPTI approach.
 But isn't that model having the same synchronization issues upon
 guest L4 updates which Andrew was fighting with?
>>> (Condensing a lot of threads down into one)
>>>
>>> All the methods have L4 synchronisation update issues, until we have a
>>> PV ABI which guarantees that L4's don't get reused.  Any improvements to
>>> the shadowing/synchronisation algorithm will benefit all approaches.
>>>
>>> Juergen: you're now adding a LTR into the context switch path which
>>> tends to be very slow.  I.e. As currently presented, this series
>>> necessarily has a higher runtime overhead than Jan's XPTI.
>>>
>>> One of my concerns is that this patch series moves further away from the
>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>> context switch code can drop a load of its slow instructions like LGDT
>>> and the VMWRITEs to update the VMCS.
>>>
>>> Jan: As to the things not covered by the current XPTI, hiding most of
>>> the .text section is important to prevent fingerprinting or ROP
>>> scanning.  This is a defence-in-depth argument, but a guest 

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Juergen Gross
On 23/01/18 10:31, Jan Beulich wrote:
 On 23.01.18 at 10:24,  wrote:
>> On 23/01/18 09:53, Jan Beulich wrote:
>> On 23.01.18 at 07:34,  wrote:
 On 22/01/18 19:39, Andrew Cooper wrote:
> One of my concerns is that this patch series moves further away from the
> secondary goal of my KAISER series, which was to have the IDT and GDT
> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
> leak which CPU you're currently scheduled on into PV guests and b) the
> context switch code can drop a load of its slow instructions like LGDT
> and the VMWRITEs to update the VMCS.

 The GDT address of a PV vcpu is depending on vcpu_id only. I don't
 see why the IDT can't be mapped to the same address on each cpu with
 my approach.
>>>
>>> You're not introducing a per-CPU range in the page tables afaics
>>> (again from overview and titles only), yet with the IDT needing
>>> to be per-CPU you'd also need a per-CPU range to map it to if
>>> you want to avoid the LIDT as well as exposing what CPU you're
>>> on (same goes for the GDT and the respective avoidance of LGDT
>>> afaict).
>>
>> After a quick look I don't see why a Meltdown mitigation can't use
>> the same IDT for all cpus: the only reason I could find for having
>> per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
>> And AMD won't need XPTI at all.
> 
> Isn't your RFC series allowing XPTI to be enabled even on AMD?

Yes, you are right. This might either want to be revisited or the
address space to be activated for SVM domains could map an IDT with
IST related traps removed.

>> The GDT of pv domains is already in the per-domain region even without
>> my patches, so I don't have to change anything regarding usage of LGDT.
> 
> Andrew's point was that eliminating the LGDT is a secondary goal.

With per-cpu mappings this is surely an obvious optimization. In the
end the overall performance should be taken as base for a decision.
His main point was avoiding exposing data like the physical cpu number
and this doesn't apply here, as the GDT is per vcpu in my case.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Juergen Gross
On 23/01/18 09:40, Jan Beulich wrote:
 On 23.01.18 at 06:50,  wrote:
>> On 22/01/18 17:51, Jan Beulich wrote:
>>> But isn't that model having the same synchronization issues upon
>>> guest L4 updates which Andrew was fighting with?
>>
>> I don't think so, as the number of shadows will always only be max. 1
>> with my approach.
> 
> How can I know that? The overview mail doesn't talk about the
> intended shadowing algorithm afaics, and none of the patches
> (judging by their titles) implements any part thereof. In

Right. That's the reason I'm telling you about it.

> particular I'd be curious to know whether what you say will
> hold also for guests not making use of the intended PV ABI
> extension.

Those guests will still be vulnerable to cross-vcpu accesses to Xen
stacks regarding Meltdown. Linux kernel is vulnerable the same way
regarding its own stacks, so there is no new vulnerability added
for Linux running as pv guests (I have to admit I don't know whether
the same applies to BSD).


Juergen


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Juergen Gross
On 23/01/18 09:53, Jan Beulich wrote:
 On 23.01.18 at 07:34,  wrote:
>> On 22/01/18 19:39, Andrew Cooper wrote:
>>> One of my concerns is that this patch series moves further away from the
>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>> context switch code can drop a load of its slow instructions like LGDT
>>> and the VMWRITEs to update the VMCS.
>>
>> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
>> see why the IDT can't be mapped to the same address on each cpu with
>> my approach.
> 
> You're not introducing a per-CPU range in the page tables afaics
> (again from overview and titles only), yet with the IDT needing
> to be per-CPU you'd also need a per-CPU range to map it to if
> you want to avoid the LIDT as well as exposing what CPU you're
> on (same goes for the GDT and the respective avoidance of LGDT
> afaict).

After a quick look I don't see why a Meltdown mitigation can't use
the same IDT for all cpus: the only reason I could find for having
per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
And AMD won't need XPTI at all.

The GDT of pv domains is already in the per-domain region even without
my patches, so I don't have to change anything regarding usage of LGDT.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Jan Beulich
>>> On 23.01.18 at 06:50,  wrote:
> On 22/01/18 17:51, Jan Beulich wrote:
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> I don't think so, as the number of shadows will always only be max. 1
> with my approach.

How can I know that? The overview mail doesn't talk about the
intended shadowing algorithm afaics, and none of the patches
(judging by their titles) implements any part thereof. In
particular I'd be curious to know whether what you say will
hold also for guests not making use of the intended PV ABI
extension.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-23 Thread Jan Beulich
>>> On 22.01.18 at 20:02,  wrote:
> On 22/01/18 18:48, George Dunlap wrote:
>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>> Jan: As to the things not covered by the current XPTI, hiding most of
>>> the .text section is important to prevent fingerprinting or ROP
>>> scanning.  This is a defence-in-depth argument, but a guest being easily
>>> able to identify whether certain XSAs are fixed or not is quite bad. 
>> I'm afraid we have a fairly different opinion of what is "quite bad".
> 
> I suggest you try talking to some real users then.
> 
>> Suppose we handed users a knob and said, "If you flip this switch,
>> attackers won't be able to tell if you've fixed XSAs or not without
>> trying them; but it will slow down your guests 20%."  How many do you
>> think would flip it, and how many would reckon that an attacker could
>> probably find out that information anyway?
> 
> Nonsense.  The performance hit is already taken.  The argument is "do
> you want an attacker able to trivially evaluate security weaknesses in
> your hypervisor", a process which usually has to be done by guesswork
> and knowing the exact binary under attack.  Having .text fully readable
> lowers the barrier to entry substantially.

I neither agree with George's reply being nonsense, nor do I think
this is an appropriate tone. _Some_ performance hit is already
taken. Further hiding of information my incur further loss of
performance, or are you telling me you can guarantee this never
ever to happen? Additionally, the amount of "guesswork" may
heavily depend on the nature of a specific issue. I can imagine
cases where such guesswork may even turn out easier than using
some side channel approach like those recent ones.

As indicated earlier, I'm not fundamentally opposed to hiding
more things, but I'm also not convinced we should hide more stuff
regardless of the price to pay.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Juergen Gross
On 23/01/18 07:34, Juergen Gross wrote:
> On 22/01/18 19:39, Andrew Cooper wrote:
>> On 22/01/18 16:51, Jan Beulich wrote:
>> On 22.01.18 at 16:00,  wrote:
 On 22/01/18 15:48, Jan Beulich wrote:
 On 22.01.18 at 15:38,  wrote:
>> On 22/01/18 15:22, Jan Beulich wrote:
>> On 22.01.18 at 15:18,  wrote:
 On 22/01/18 13:50, Jan Beulich wrote:
 On 22.01.18 at 13:32,  wrote:
>> As a preparation for doing page table isolation in the Xen hypervisor
>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>> 64 bit PV domains mapped to the per-domain virtual area.
>>
>> The per-vcpu stacks are used for early interrupt handling only. After
>> saving the domain's registers stacks are switched back to the normal
>> per physical cpu ones in order to be able to address on-stack data
>> from other cpus e.g. while handling IPIs.
>>
>> Adding %cr3 switching between saving of the registers and switching
>> the stacks will enable the possibility to run guest code without any
>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>> able to access other domains data.
>>
>> Without any further measures it will still be possible for e.g. a
>> guest's user program to read stack data of another vcpu of the same
>> domain, but this can be easily avoided by a little PV-ABI 
>> modification
>> introducing per-cpu user address spaces.
>>
>> This series is meant as a replacement for Andrew's patch series:
>> "x86: Prerequisite work for a Xen KAISER solution".
> Considering in particular the two reverts, what I'm missing here
> is a clear description of the meaningful additional protection this
> approach provides over the band-aid. For context see also
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>  
 My approach supports mapping only the following data while the guest is
 running (apart form the guest's own data, of course):

 - the per-vcpu entry stacks of the domain which will contain only the
   guest's registers saved when an interrupt occurs
 - the per-vcpu GDTs and TSSs of the domain
 - the IDT
 - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S

 All other hypervisor data and code can be completely hidden from the
 guests.
>>> I understand that. What I'm not clear about is: Which parts of
>>> the additionally hidden data are actually necessary (or at least
>>> very desirable) to hide?
>> Necessary:
>> - other guests' memory (e.g. physical memory 1:1 mapping)
>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>   code emulator buffers
>> - other guests' register values e.g. in vcpu structure
> All of this is already being made invisible by the band-aid (with the
> exception of leftovers on the hypervisor stacks across context
> switches, which we've already said could be taken care of by
> memset()ing that area). I'm asking about the _additional_ benefits
> of your approach.
 I'm quite sure the performance will be much better as it doesn't require
 per physical cpu L4 page tables, but just a shadow L4 table for each
 guest L4 table, similar to the Linux kernel KPTI approach.
>>> But isn't that model having the same synchronization issues upon
>>> guest L4 updates which Andrew was fighting with?
>>
>> (Condensing a lot of threads down into one)
>>
>> All the methods have L4 synchronisation update issues, until we have a
>> PV ABI which guarantees that L4's don't get reused.  Any improvements to
>> the shadowing/synchronisation algorithm will benefit all approaches.
>>
>> Juergen: you're now adding a LTR into the context switch path which
>> tends to be very slow.  I.e. As currently presented, this series
>> necessarily has a higher runtime overhead than Jan's XPTI.
> 
> Sure? How slow is LTR compared to a copy of nearly 4kB of data?

I just added some measurement code to ltr(). On my system ltr takes
about 320 cycles, so a little bit more than 100ns (2.9 GHz).

With 10.000 context switches per second and 2 ltr instructions per
context switch this would add up to about 0.2% performance loss.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Juergen Gross
On 22/01/18 22:45, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 22, 2018 at 01:32:44PM +0100, Juergen Gross wrote:
>> As a preparation for doing page table isolation in the Xen hypervisor
>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>> 64 bit PV domains mapped to the per-domain virtual area.
>>
>> The per-vcpu stacks are used for early interrupt handling only. After
>> saving the domain's registers stacks are switched back to the normal
>> per physical cpu ones in order to be able to address on-stack data
>> from other cpus e.g. while handling IPIs.
>>
>> Adding %cr3 switching between saving of the registers and switching
>> the stacks will enable the possibility to run guest code without any
>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>> able to access other domains data.
>>
>> Without any further measures it will still be possible for e.g. a
>> guest's user program to read stack data of another vcpu of the same
>> domain, but this can be easily avoided by a little PV-ABI modification
>> introducing per-cpu user address spaces.
>>
>> This series is meant as a replacement for Andrew's patch series:
>> "x86: Prerequisite work for a Xen KAISER solution".
>>
>> What needs to be done:
>> - verify livepatching is still working
> 
> Is there an git repo for this?

https://github.com/jgross1/xen.git xpti


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Juergen Gross
On 22/01/18 19:39, Andrew Cooper wrote:
> On 22/01/18 16:51, Jan Beulich wrote:
> On 22.01.18 at 16:00,  wrote:
>>> On 22/01/18 15:48, Jan Beulich wrote:
>>> On 22.01.18 at 15:38,  wrote:
> On 22/01/18 15:22, Jan Beulich wrote:
> On 22.01.18 at 15:18,  wrote:
>>> On 22/01/18 13:50, Jan Beulich wrote:
>>> On 22.01.18 at 13:32,  wrote:
> As a preparation for doing page table isolation in the Xen hypervisor
> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
> 64 bit PV domains mapped to the per-domain virtual area.
>
> The per-vcpu stacks are used for early interrupt handling only. After
> saving the domain's registers stacks are switched back to the normal
> per physical cpu ones in order to be able to address on-stack data
> from other cpus e.g. while handling IPIs.
>
> Adding %cr3 switching between saving of the registers and switching
> the stacks will enable the possibility to run guest code without any
> per physical cpu mapping, i.e. avoiding the threat of a guest being
> able to access other domains data.
>
> Without any further measures it will still be possible for e.g. a
> guest's user program to read stack data of another vcpu of the same
> domain, but this can be easily avoided by a little PV-ABI modification
> introducing per-cpu user address spaces.
>
> This series is meant as a replacement for Andrew's patch series:
> "x86: Prerequisite work for a Xen KAISER solution".
 Considering in particular the two reverts, what I'm missing here
 is a clear description of the meaningful additional protection this
 approach provides over the band-aid. For context see also
 https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
  
>>> My approach supports mapping only the following data while the guest is
>>> running (apart form the guest's own data, of course):
>>>
>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>   guest's registers saved when an interrupt occurs
>>> - the per-vcpu GDTs and TSSs of the domain
>>> - the IDT
>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>
>>> All other hypervisor data and code can be completely hidden from the
>>> guests.
>> I understand that. What I'm not clear about is: Which parts of
>> the additionally hidden data are actually necessary (or at least
>> very desirable) to hide?
> Necessary:
> - other guests' memory (e.g. physical memory 1:1 mapping)
> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>   code emulator buffers
> - other guests' register values e.g. in vcpu structure
 All of this is already being made invisible by the band-aid (with the
 exception of leftovers on the hypervisor stacks across context
 switches, which we've already said could be taken care of by
 memset()ing that area). I'm asking about the _additional_ benefits
 of your approach.
>>> I'm quite sure the performance will be much better as it doesn't require
>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>> guest L4 table, similar to the Linux kernel KPTI approach.
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> (Condensing a lot of threads down into one)
> 
> All the methods have L4 synchronisation update issues, until we have a
> PV ABI which guarantees that L4's don't get reused.  Any improvements to
> the shadowing/synchronisation algorithm will benefit all approaches.
> 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.

Sure? How slow is LTR compared to a copy of nearly 4kB of data?

> One of my concerns is that this patch series moves further away from the
> secondary goal of my KAISER series, which was to have the IDT and GDT
> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
> leak which CPU you're currently scheduled on into PV guests and b) the
> context switch code can drop a load of its slow instructions like LGDT
> and the VMWRITEs to update the VMCS.

The GDT address of a PV vcpu is depending on vcpu_id only. I don't
see why the IDT can't be mapped to the same address on each cpu with
my approach.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Juergen Gross
On 22/01/18 17:51, Jan Beulich wrote:
 On 22.01.18 at 16:00,  wrote:
>> On 22/01/18 15:48, Jan Beulich wrote:
>> On 22.01.18 at 15:38,  wrote:
 On 22/01/18 15:22, Jan Beulich wrote:
 On 22.01.18 at 15:18,  wrote:
>> On 22/01/18 13:50, Jan Beulich wrote:
>> On 22.01.18 at 13:32,  wrote:
 As a preparation for doing page table isolation in the Xen hypervisor
 in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
 64 bit PV domains mapped to the per-domain virtual area.

 The per-vcpu stacks are used for early interrupt handling only. After
 saving the domain's registers stacks are switched back to the normal
 per physical cpu ones in order to be able to address on-stack data
 from other cpus e.g. while handling IPIs.

 Adding %cr3 switching between saving of the registers and switching
 the stacks will enable the possibility to run guest code without any
 per physical cpu mapping, i.e. avoiding the threat of a guest being
 able to access other domains data.

 Without any further measures it will still be possible for e.g. a
 guest's user program to read stack data of another vcpu of the same
 domain, but this can be easily avoided by a little PV-ABI modification
 introducing per-cpu user address spaces.

 This series is meant as a replacement for Andrew's patch series:
 "x86: Prerequisite work for a Xen KAISER solution".
>>>
>>> Considering in particular the two reverts, what I'm missing here
>>> is a clear description of the meaningful additional protection this
>>> approach provides over the band-aid. For context see also
>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>>>  
>>
>> My approach supports mapping only the following data while the guest is
>> running (apart form the guest's own data, of course):
>>
>> - the per-vcpu entry stacks of the domain which will contain only the
>>   guest's registers saved when an interrupt occurs
>> - the per-vcpu GDTs and TSSs of the domain
>> - the IDT
>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>
>> All other hypervisor data and code can be completely hidden from the
>> guests.
>
> I understand that. What I'm not clear about is: Which parts of
> the additionally hidden data are actually necessary (or at least
> very desirable) to hide?

 Necessary:
 - other guests' memory (e.g. physical memory 1:1 mapping)
 - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
   code emulator buffers
 - other guests' register values e.g. in vcpu structure
>>>
>>> All of this is already being made invisible by the band-aid (with the
>>> exception of leftovers on the hypervisor stacks across context
>>> switches, which we've already said could be taken care of by
>>> memset()ing that area). I'm asking about the _additional_ benefits
>>> of your approach.
>>
>> I'm quite sure the performance will be much better as it doesn't require
>> per physical cpu L4 page tables, but just a shadow L4 table for each
>> guest L4 table, similar to the Linux kernel KPTI approach.
> 
> But isn't that model having the same synchronization issues upon
> guest L4 updates which Andrew was fighting with?

I don't think so, as the number of shadows will always only be max. 1
with my approach.

Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Andrew Cooper
On 22/01/18 18:48, George Dunlap wrote:
> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>> On 22/01/18 16:51, Jan Beulich wrote:
>> On 22.01.18 at 16:00,  wrote:
 On 22/01/18 15:48, Jan Beulich wrote:
 On 22.01.18 at 15:38,  wrote:
>> On 22/01/18 15:22, Jan Beulich wrote:
>> On 22.01.18 at 15:18,  wrote:
 On 22/01/18 13:50, Jan Beulich wrote:
 On 22.01.18 at 13:32,  wrote:
>> As a preparation for doing page table isolation in the Xen hypervisor
>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>> 64 bit PV domains mapped to the per-domain virtual area.
>>
>> The per-vcpu stacks are used for early interrupt handling only. After
>> saving the domain's registers stacks are switched back to the normal
>> per physical cpu ones in order to be able to address on-stack data
>> from other cpus e.g. while handling IPIs.
>>
>> Adding %cr3 switching between saving of the registers and switching
>> the stacks will enable the possibility to run guest code without any
>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>> able to access other domains data.
>>
>> Without any further measures it will still be possible for e.g. a
>> guest's user program to read stack data of another vcpu of the same
>> domain, but this can be easily avoided by a little PV-ABI 
>> modification
>> introducing per-cpu user address spaces.
>>
>> This series is meant as a replacement for Andrew's patch series:
>> "x86: Prerequisite work for a Xen KAISER solution".
> Considering in particular the two reverts, what I'm missing here
> is a clear description of the meaningful additional protection this
> approach provides over the band-aid. For context see also
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>  
 My approach supports mapping only the following data while the guest is
 running (apart form the guest's own data, of course):

 - the per-vcpu entry stacks of the domain which will contain only the
   guest's registers saved when an interrupt occurs
 - the per-vcpu GDTs and TSSs of the domain
 - the IDT
 - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S

 All other hypervisor data and code can be completely hidden from the
 guests.
>>> I understand that. What I'm not clear about is: Which parts of
>>> the additionally hidden data are actually necessary (or at least
>>> very desirable) to hide?
>> Necessary:
>> - other guests' memory (e.g. physical memory 1:1 mapping)
>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>   code emulator buffers
>> - other guests' register values e.g. in vcpu structure
> All of this is already being made invisible by the band-aid (with the
> exception of leftovers on the hypervisor stacks across context
> switches, which we've already said could be taken care of by
> memset()ing that area). I'm asking about the _additional_ benefits
> of your approach.
 I'm quite sure the performance will be much better as it doesn't require
 per physical cpu L4 page tables, but just a shadow L4 table for each
 guest L4 table, similar to the Linux kernel KPTI approach.
>>> But isn't that model having the same synchronization issues upon
>>> guest L4 updates which Andrew was fighting with?
>> (Condensing a lot of threads down into one)
>>
>> All the methods have L4 synchronisation update issues, until we have a
>> PV ABI which guarantees that L4's don't get reused.  Any improvements to
>> the shadowing/synchronisation algorithm will benefit all approaches.
>>
>> Juergen: you're now adding a LTR into the context switch path which
>> tends to be very slow.  I.e. As currently presented, this series
>> necessarily has a higher runtime overhead than Jan's XPTI.
>>
>> One of my concerns is that this patch series moves further away from the
>> secondary goal of my KAISER series, which was to have the IDT and GDT
>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>> leak which CPU you're currently scheduled on into PV guests and b) the
>> context switch code can drop a load of its slow instructions like LGDT
>> and the VMWRITEs to update the VMCS.
>>
>> Jan: As to the things not covered by the current XPTI, hiding most of
>> the .text section is important to prevent fingerprinting or ROP
>> scanning.  This is a defence-in-depth argument, but a guest being easily
>> able to identify whether certain XSAs are fixed or not is quite bad. 
> I'm afraid we have a fairly different opinion of what is "quite bad".

I suggest you try 

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread George Dunlap
On 01/22/2018 06:39 PM, Andrew Cooper wrote:
> On 22/01/18 16:51, Jan Beulich wrote:
> On 22.01.18 at 16:00,  wrote:
>>> On 22/01/18 15:48, Jan Beulich wrote:
>>> On 22.01.18 at 15:38,  wrote:
> On 22/01/18 15:22, Jan Beulich wrote:
> On 22.01.18 at 15:18,  wrote:
>>> On 22/01/18 13:50, Jan Beulich wrote:
>>> On 22.01.18 at 13:32,  wrote:
> As a preparation for doing page table isolation in the Xen hypervisor
> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
> 64 bit PV domains mapped to the per-domain virtual area.
>
> The per-vcpu stacks are used for early interrupt handling only. After
> saving the domain's registers stacks are switched back to the normal
> per physical cpu ones in order to be able to address on-stack data
> from other cpus e.g. while handling IPIs.
>
> Adding %cr3 switching between saving of the registers and switching
> the stacks will enable the possibility to run guest code without any
> per physical cpu mapping, i.e. avoiding the threat of a guest being
> able to access other domains data.
>
> Without any further measures it will still be possible for e.g. a
> guest's user program to read stack data of another vcpu of the same
> domain, but this can be easily avoided by a little PV-ABI modification
> introducing per-cpu user address spaces.
>
> This series is meant as a replacement for Andrew's patch series:
> "x86: Prerequisite work for a Xen KAISER solution".
 Considering in particular the two reverts, what I'm missing here
 is a clear description of the meaningful additional protection this
 approach provides over the band-aid. For context see also
 https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
  
>>> My approach supports mapping only the following data while the guest is
>>> running (apart form the guest's own data, of course):
>>>
>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>   guest's registers saved when an interrupt occurs
>>> - the per-vcpu GDTs and TSSs of the domain
>>> - the IDT
>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>
>>> All other hypervisor data and code can be completely hidden from the
>>> guests.
>> I understand that. What I'm not clear about is: Which parts of
>> the additionally hidden data are actually necessary (or at least
>> very desirable) to hide?
> Necessary:
> - other guests' memory (e.g. physical memory 1:1 mapping)
> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>   code emulator buffers
> - other guests' register values e.g. in vcpu structure
 All of this is already being made invisible by the band-aid (with the
 exception of leftovers on the hypervisor stacks across context
 switches, which we've already said could be taken care of by
 memset()ing that area). I'm asking about the _additional_ benefits
 of your approach.
>>> I'm quite sure the performance will be much better as it doesn't require
>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>> guest L4 table, similar to the Linux kernel KPTI approach.
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> (Condensing a lot of threads down into one)
> 
> All the methods have L4 synchronisation update issues, until we have a
> PV ABI which guarantees that L4's don't get reused.  Any improvements to
> the shadowing/synchronisation algorithm will benefit all approaches.
> 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.
> 
> One of my concerns is that this patch series moves further away from the
> secondary goal of my KAISER series, which was to have the IDT and GDT
> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
> leak which CPU you're currently scheduled on into PV guests and b) the
> context switch code can drop a load of its slow instructions like LGDT
> and the VMWRITEs to update the VMCS.
> 
> Jan: As to the things not covered by the current XPTI, hiding most of
> the .text section is important to prevent fingerprinting or ROP
> scanning.  This is a defence-in-depth argument, but a guest being easily
> able to identify whether certain XSAs are fixed or not is quite bad. 

I'm afraid we have a fairly different opinion of what is "quite bad".
Suppose we handed users a knob and said, "If you flip this switch,
attackers won't be able to tell if you've fixed XSAs or not without
trying them; but it 

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Andrew Cooper
On 22/01/18 16:51, Jan Beulich wrote:
 On 22.01.18 at 16:00,  wrote:
>> On 22/01/18 15:48, Jan Beulich wrote:
>> On 22.01.18 at 15:38,  wrote:
 On 22/01/18 15:22, Jan Beulich wrote:
 On 22.01.18 at 15:18,  wrote:
>> On 22/01/18 13:50, Jan Beulich wrote:
>> On 22.01.18 at 13:32,  wrote:
 As a preparation for doing page table isolation in the Xen hypervisor
 in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
 64 bit PV domains mapped to the per-domain virtual area.

 The per-vcpu stacks are used for early interrupt handling only. After
 saving the domain's registers stacks are switched back to the normal
 per physical cpu ones in order to be able to address on-stack data
 from other cpus e.g. while handling IPIs.

 Adding %cr3 switching between saving of the registers and switching
 the stacks will enable the possibility to run guest code without any
 per physical cpu mapping, i.e. avoiding the threat of a guest being
 able to access other domains data.

 Without any further measures it will still be possible for e.g. a
 guest's user program to read stack data of another vcpu of the same
 domain, but this can be easily avoided by a little PV-ABI modification
 introducing per-cpu user address spaces.

 This series is meant as a replacement for Andrew's patch series:
 "x86: Prerequisite work for a Xen KAISER solution".
>>> Considering in particular the two reverts, what I'm missing here
>>> is a clear description of the meaningful additional protection this
>>> approach provides over the band-aid. For context see also
>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>>>  
>> My approach supports mapping only the following data while the guest is
>> running (apart form the guest's own data, of course):
>>
>> - the per-vcpu entry stacks of the domain which will contain only the
>>   guest's registers saved when an interrupt occurs
>> - the per-vcpu GDTs and TSSs of the domain
>> - the IDT
>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>
>> All other hypervisor data and code can be completely hidden from the
>> guests.
> I understand that. What I'm not clear about is: Which parts of
> the additionally hidden data are actually necessary (or at least
> very desirable) to hide?
 Necessary:
 - other guests' memory (e.g. physical memory 1:1 mapping)
 - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
   code emulator buffers
 - other guests' register values e.g. in vcpu structure
>>> All of this is already being made invisible by the band-aid (with the
>>> exception of leftovers on the hypervisor stacks across context
>>> switches, which we've already said could be taken care of by
>>> memset()ing that area). I'm asking about the _additional_ benefits
>>> of your approach.
>> I'm quite sure the performance will be much better as it doesn't require
>> per physical cpu L4 page tables, but just a shadow L4 table for each
>> guest L4 table, similar to the Linux kernel KPTI approach.
> But isn't that model having the same synchronization issues upon
> guest L4 updates which Andrew was fighting with?

(Condensing a lot of threads down into one)

All the methods have L4 synchronisation update issues, until we have a
PV ABI which guarantees that L4's don't get reused.  Any improvements to
the shadowing/synchronisation algorithm will benefit all approaches.

Juergen: you're now adding a LTR into the context switch path which
tends to be very slow.  I.e. As currently presented, this series
necessarily has a higher runtime overhead than Jan's XPTI.

One of my concerns is that this patch series moves further away from the
secondary goal of my KAISER series, which was to have the IDT and GDT
mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
leak which CPU you're currently scheduled on into PV guests and b) the
context switch code can drop a load of its slow instructions like LGDT
and the VMWRITEs to update the VMCS.

Jan: As to the things not covered by the current XPTI, hiding most of
the .text section is important to prevent fingerprinting or ROP
scanning.  This is a defence-in-depth argument, but a guest being easily
able to identify whether certain XSAs are fixed or not is quite bad. 
Also, a load of CPU 0's data data-structures, including the stack is
visible in .data.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Jan Beulich
>>> On 22.01.18 at 16:00,  wrote:
> On 22/01/18 15:48, Jan Beulich wrote:
> On 22.01.18 at 15:38,  wrote:
>>> On 22/01/18 15:22, Jan Beulich wrote:
>>> On 22.01.18 at 15:18,  wrote:
> On 22/01/18 13:50, Jan Beulich wrote:
> On 22.01.18 at 13:32,  wrote:
>>> As a preparation for doing page table isolation in the Xen hypervisor
>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>
>>> The per-vcpu stacks are used for early interrupt handling only. After
>>> saving the domain's registers stacks are switched back to the normal
>>> per physical cpu ones in order to be able to address on-stack data
>>> from other cpus e.g. while handling IPIs.
>>>
>>> Adding %cr3 switching between saving of the registers and switching
>>> the stacks will enable the possibility to run guest code without any
>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>> able to access other domains data.
>>>
>>> Without any further measures it will still be possible for e.g. a
>>> guest's user program to read stack data of another vcpu of the same
>>> domain, but this can be easily avoided by a little PV-ABI modification
>>> introducing per-cpu user address spaces.
>>>
>>> This series is meant as a replacement for Andrew's patch series:
>>> "x86: Prerequisite work for a Xen KAISER solution".
>>
>> Considering in particular the two reverts, what I'm missing here
>> is a clear description of the meaningful additional protection this
>> approach provides over the band-aid. For context see also
>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>>  
>
> My approach supports mapping only the following data while the guest is
> running (apart form the guest's own data, of course):
>
> - the per-vcpu entry stacks of the domain which will contain only the
>   guest's registers saved when an interrupt occurs
> - the per-vcpu GDTs and TSSs of the domain
> - the IDT
> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>
> All other hypervisor data and code can be completely hidden from the
> guests.

 I understand that. What I'm not clear about is: Which parts of
 the additionally hidden data are actually necessary (or at least
 very desirable) to hide?
>>>
>>> Necessary:
>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>   code emulator buffers
>>> - other guests' register values e.g. in vcpu structure
>> 
>> All of this is already being made invisible by the band-aid (with the
>> exception of leftovers on the hypervisor stacks across context
>> switches, which we've already said could be taken care of by
>> memset()ing that area). I'm asking about the _additional_ benefits
>> of your approach.
> 
> I'm quite sure the performance will be much better as it doesn't require
> per physical cpu L4 page tables, but just a shadow L4 table for each
> guest L4 table, similar to the Linux kernel KPTI approach.

But isn't that model having the same synchronization issues upon
guest L4 updates which Andrew was fighting with?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Juergen Gross
On 22/01/18 15:48, Jan Beulich wrote:
 On 22.01.18 at 15:38,  wrote:
>> On 22/01/18 15:22, Jan Beulich wrote:
>> On 22.01.18 at 15:18,  wrote:
 On 22/01/18 13:50, Jan Beulich wrote:
 On 22.01.18 at 13:32,  wrote:
>> As a preparation for doing page table isolation in the Xen hypervisor
>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>> 64 bit PV domains mapped to the per-domain virtual area.
>>
>> The per-vcpu stacks are used for early interrupt handling only. After
>> saving the domain's registers stacks are switched back to the normal
>> per physical cpu ones in order to be able to address on-stack data
>> from other cpus e.g. while handling IPIs.
>>
>> Adding %cr3 switching between saving of the registers and switching
>> the stacks will enable the possibility to run guest code without any
>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>> able to access other domains data.
>>
>> Without any further measures it will still be possible for e.g. a
>> guest's user program to read stack data of another vcpu of the same
>> domain, but this can be easily avoided by a little PV-ABI modification
>> introducing per-cpu user address spaces.
>>
>> This series is meant as a replacement for Andrew's patch series:
>> "x86: Prerequisite work for a Xen KAISER solution".
>
> Considering in particular the two reverts, what I'm missing here
> is a clear description of the meaningful additional protection this
> approach provides over the band-aid. For context see also
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html
>  

 My approach supports mapping only the following data while the guest is
 running (apart form the guest's own data, of course):

 - the per-vcpu entry stacks of the domain which will contain only the
   guest's registers saved when an interrupt occurs
 - the per-vcpu GDTs and TSSs of the domain
 - the IDT
 - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S

 All other hypervisor data and code can be completely hidden from the
 guests.
>>>
>>> I understand that. What I'm not clear about is: Which parts of
>>> the additionally hidden data are actually necessary (or at least
>>> very desirable) to hide?
>>
>> Necessary:
>> - other guests' memory (e.g. physical memory 1:1 mapping)
>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>   code emulator buffers
>> - other guests' register values e.g. in vcpu structure
> 
> All of this is already being made invisible by the band-aid (with the
> exception of leftovers on the hypervisor stacks across context
> switches, which we've already said could be taken care of by
> memset()ing that area). I'm asking about the _additional_ benefits
> of your approach.

I'm quite sure the performance will be much better as it doesn't require
per physical cpu L4 page tables, but just a shadow L4 table for each
guest L4 table, similar to the Linux kernel KPTI approach.

> 
>> Desirable: as much as possible. For instance I don't buy your reasoning
>> regarding the Xen binary: how would you do this e.g. in a public cloud?
>> How do you know which Xen binary (possibly with livepatches) is being
>> used there? And today we don't have something like KASLR in Xen, but
>> not hiding the text and RO data will make the introduction of that quite
>> useless.
> 
> I'm aware that there are people thinking that .text and .rodata
> should be hidden; what I'm not really aware of is the reasoning
> behind that.

In case an attacker knows of some vulnerability it is just harder to use
that knowledge without knowing where specific data structures or coding
is living. Its like switching the lights off when you know somebody is
aiming with a gun at you. The odds are much better if the killer can't
see you.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Jan Beulich
>>> On 22.01.18 at 15:38,  wrote:
> On 22/01/18 15:22, Jan Beulich wrote:
> On 22.01.18 at 15:18,  wrote:
>>> On 22/01/18 13:50, Jan Beulich wrote:
>>> On 22.01.18 at 13:32,  wrote:
> As a preparation for doing page table isolation in the Xen hypervisor
> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
> 64 bit PV domains mapped to the per-domain virtual area.
>
> The per-vcpu stacks are used for early interrupt handling only. After
> saving the domain's registers stacks are switched back to the normal
> per physical cpu ones in order to be able to address on-stack data
> from other cpus e.g. while handling IPIs.
>
> Adding %cr3 switching between saving of the registers and switching
> the stacks will enable the possibility to run guest code without any
> per physical cpu mapping, i.e. avoiding the threat of a guest being
> able to access other domains data.
>
> Without any further measures it will still be possible for e.g. a
> guest's user program to read stack data of another vcpu of the same
> domain, but this can be easily avoided by a little PV-ABI modification
> introducing per-cpu user address spaces.
>
> This series is meant as a replacement for Andrew's patch series:
> "x86: Prerequisite work for a Xen KAISER solution".

 Considering in particular the two reverts, what I'm missing here
 is a clear description of the meaningful additional protection this
 approach provides over the band-aid. For context see also
 https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>
>>> My approach supports mapping only the following data while the guest is
>>> running (apart form the guest's own data, of course):
>>>
>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>   guest's registers saved when an interrupt occurs
>>> - the per-vcpu GDTs and TSSs of the domain
>>> - the IDT
>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>
>>> All other hypervisor data and code can be completely hidden from the
>>> guests.
>> 
>> I understand that. What I'm not clear about is: Which parts of
>> the additionally hidden data are actually necessary (or at least
>> very desirable) to hide?
> 
> Necessary:
> - other guests' memory (e.g. physical memory 1:1 mapping)
> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>   code emulator buffers
> - other guests' register values e.g. in vcpu structure

All of this is already being made invisible by the band-aid (with the
exception of leftovers on the hypervisor stacks across context
switches, which we've already said could be taken care of by
memset()ing that area). I'm asking about the _additional_ benefits
of your approach.

> Desirable: as much as possible. For instance I don't buy your reasoning
> regarding the Xen binary: how would you do this e.g. in a public cloud?
> How do you know which Xen binary (possibly with livepatches) is being
> used there? And today we don't have something like KASLR in Xen, but
> not hiding the text and RO data will make the introduction of that quite
> useless.

I'm aware that there are people thinking that .text and .rodata
should be hidden; what I'm not really aware of is the reasoning
behind that.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Juergen Gross
On 22/01/18 15:22, Jan Beulich wrote:
 On 22.01.18 at 15:18,  wrote:
>> On 22/01/18 13:50, Jan Beulich wrote:
>> On 22.01.18 at 13:32,  wrote:
 As a preparation for doing page table isolation in the Xen hypervisor
 in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
 64 bit PV domains mapped to the per-domain virtual area.

 The per-vcpu stacks are used for early interrupt handling only. After
 saving the domain's registers stacks are switched back to the normal
 per physical cpu ones in order to be able to address on-stack data
 from other cpus e.g. while handling IPIs.

 Adding %cr3 switching between saving of the registers and switching
 the stacks will enable the possibility to run guest code without any
 per physical cpu mapping, i.e. avoiding the threat of a guest being
 able to access other domains data.

 Without any further measures it will still be possible for e.g. a
 guest's user program to read stack data of another vcpu of the same
 domain, but this can be easily avoided by a little PV-ABI modification
 introducing per-cpu user address spaces.

 This series is meant as a replacement for Andrew's patch series:
 "x86: Prerequisite work for a Xen KAISER solution".
>>>
>>> Considering in particular the two reverts, what I'm missing here
>>> is a clear description of the meaningful additional protection this
>>> approach provides over the band-aid. For context see also
>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>
>> My approach supports mapping only the following data while the guest is
>> running (apart form the guest's own data, of course):
>>
>> - the per-vcpu entry stacks of the domain which will contain only the
>>   guest's registers saved when an interrupt occurs
>> - the per-vcpu GDTs and TSSs of the domain
>> - the IDT
>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>
>> All other hypervisor data and code can be completely hidden from the
>> guests.
> 
> I understand that. What I'm not clear about is: Which parts of
> the additionally hidden data are actually necessary (or at least
> very desirable) to hide?

Necessary:
- other guests' memory (e.g. physical memory 1:1 mapping)
- data from other guests e.g.in stack pages, debug buffers, I/O buffers,
  code emulator buffers
- other guests' register values e.g. in vcpu structure

Desirable: as much as possible. For instance I don't buy your reasoning
regarding the Xen binary: how would you do this e.g. in a public cloud?
How do you know which Xen binary (possibly with livepatches) is being
used there? And today we don't have something like KASLR in Xen, but
not hiding the text and RO data will make the introduction of that quite
useless.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains

2018-01-22 Thread Jan Beulich
>>> On 22.01.18 at 13:32,  wrote:
> As a preparation for doing page table isolation in the Xen hypervisor
> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
> 64 bit PV domains mapped to the per-domain virtual area.
> 
> The per-vcpu stacks are used for early interrupt handling only. After
> saving the domain's registers stacks are switched back to the normal
> per physical cpu ones in order to be able to address on-stack data
> from other cpus e.g. while handling IPIs.
> 
> Adding %cr3 switching between saving of the registers and switching
> the stacks will enable the possibility to run guest code without any
> per physical cpu mapping, i.e. avoiding the threat of a guest being
> able to access other domains data.
> 
> Without any further measures it will still be possible for e.g. a
> guest's user program to read stack data of another vcpu of the same
> domain, but this can be easily avoided by a little PV-ABI modification
> introducing per-cpu user address spaces.
> 
> This series is meant as a replacement for Andrew's patch series:
> "x86: Prerequisite work for a Xen KAISER solution".

Considering in particular the two reverts, what I'm missing here
is a clear description of the meaningful additional protection this
approach provides over the band-aid. For context see also
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel