Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-12-12 Thread Jan Beulich
>>> On 11.12.18 at 19:05,  wrote:
> On Fri, Oct 26, 2018 at 05:20:47AM -0600, Jan Beulich wrote:
>> >>> On 26.10.18 at 12:51,  wrote:
>> > The basic solution involves having a xenheap virtual address mapping
>> > area not tied to the physical layout of the memory.  domheap and xenheap
>> > memory would have to come from the same pool, but xenheap would need to
>> > be mapped into the xenheap virtual memory region before being returned.
>> 
>> Wouldn't this most easily be done by making alloc_xenheap_pages()
>> call alloc_domheap_pages() and then vmap() the result? Of course
>> we may need to grow the vmap area in that case.
> 
> The existing vmap area is 64GB, but that should be big enough for Xen?

In the common case perhaps. But what about extreme cases, like
very many VMs on multi-Tb hosts?

> If that's not big enough, we need to move that area to a different
> location, because it can't expand to either side of the address space.

When the directmap goes away, ample address space gets freed
up.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-12-11 Thread Wei Liu
On Fri, Oct 26, 2018 at 05:20:47AM -0600, Jan Beulich wrote:
> >>> On 26.10.18 at 12:51,  wrote:
> > On 10/26/2018 10:56 AM, Jan Beulich wrote:
> > On 26.10.18 at 11:28,  wrote:
> >>> On Fri, Oct 26, 2018 at 03:16:15AM -0600, Jan Beulich wrote:
> >>> On 25.10.18 at 18:29,  wrote:
> > A split xenheap model means that data pertaining to other guests isn't
> > mapped in the context of this vcpu, so cannot be brought into the cache.
> 
>  It was not clear to me from Wei's original mail that talk here is
>  about "split" in a sense of "per-domain"; I was assuming the
>  CONFIG_SEPARATE_XENHEAP mode instead.
> >>>
> >>> The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
> >>> I what I wanted most is the partial direct map which reduces the amount
> >>> of data mapped inside Xen context -- the original idea was removing
> >>> direct map discussed during one of the calls IIRC. I thought making the
> >>> partial direct map mode work and make it as small as possible will get
> >>> us 90% there.
> >>>
> >>> The "per-domain" heap is a different work item.
> >> 
> >> But if we mean to go that route, going (back) to the separate
> >> Xen heap model seems just like an extra complication to me.
> >> Yet I agree that this would remove the need for a fair chunk of
> >> the direct map. Otoh a statically partitioned Xen heap would
> >> bring back scalability issues which we had specifically meant to
> >> get rid of by moving away from that model.
> > 
> > I think turning SEPARATE_XENHEAP back on would just be the first step.
> > We definitely would then need to sort things out so that it's scalable
> > again.
> > 
> > After system set-up, the key difference between xenheap and domheap
> > pages is that xenheap pages are assumed to be always mapped (i.e., you
> > can keep a pointer to them and it will be valid), whereas domheap pages
> > cannot assumed to be mapped, and need to be wrapped with
> > [un]map_domain_page().
> > 
> > The basic solution involves having a xenheap virtual address mapping
> > area not tied to the physical layout of the memory.  domheap and xenheap
> > memory would have to come from the same pool, but xenheap would need to
> > be mapped into the xenheap virtual memory region before being returned.
> 
> Wouldn't this most easily be done by making alloc_xenheap_pages()
> call alloc_domheap_pages() and then vmap() the result? Of course
> we may need to grow the vmap area in that case.

The existing vmap area is 64GB, but that should be big enough for Xen?

If that's not big enough, we need to move that area to a different
location, because it can't expand to either side of the address space.

Wei.

> 
> Jan
> 
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-12-10 Thread George Dunlap
On 12/10/18 12:12 PM, George Dunlap wrote:
> On 12/7/18 6:40 PM, Wei Liu wrote:
>> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
>>> Hello,
>>>
>>> This is an accumulation and summary of various tasks which have been
>>> discussed since the revelation of the speculative security issues in
>>> January, and also an invitation to discuss alternative ideas.  They are
>>> x86 specific, but a lot of the principles are architecture-agnostic.
>>>
>>> 1) A secrets-free hypervisor.
>>>
>>> Basically every hypercall can be (ab)used by a guest, and used as an
>>> arbitrary cache-load gadget.  Logically, this is the first half of a
>>> Spectre SP1 gadget, and is usually the first stepping stone to
>>> exploiting one of the speculative sidechannels.
>>>
>>> Short of compiling Xen with LLVM's Speculative Load Hardening (which is
>>> still experimental, and comes with a ~30% perf hit in the common case),
>>> this is unavoidable.  Furthermore, throwing a few array_index_nospec()
>>> into the code isn't a viable solution to the problem.
>>>
>>> An alternative option is to have less data mapped into Xen's virtual
>>> address space - if a piece of memory isn't mapped, it can't be loaded
>>> into the cache.
>>>
>>> An easy first step here is to remove Xen's directmap, which will mean
>>> that guests general RAM isn't mapped by default into Xen's address
>>> space.  This will come with some performance hit, as the
>>> map_domain_page() infrastructure will now have to actually
>>> create/destroy mappings, but removing the directmap will cause an
>>> improvement for non-speculative security as well (No possibility of
>>> ret2dir as an exploit technique).
>>>
>>> Beyond the directmap, there are plenty of other interesting secrets in
>>> the Xen heap and other mappings, such as the stacks of the other pcpus. 
>>> Fixing this requires moving Xen to having a non-uniform memory layout,
>>> and this is much harder to change.  I already experimented with this as
>>> a meltdown mitigation around about a year ago, and posted the resulting
>>> series on Jan 4th,
>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
>>> some trivial bits of which have already found their way upstream.
>>>
>>> To have a non-uniform memory layout, Xen may not share L4 pagetables. 
>>> i.e. Xen must never have two pcpus which reference the same pagetable in
>>> %cr3.
>>>
>>> This property already holds for 32bit PV guests, and all HVM guests, but
>>> 64bit PV guests are the sticking point.  Because Linux has a flat memory
>>> layout, when a 64bit PV guest schedules two threads from the same
>>> process on separate vcpus, those two vcpus have the same virtual %cr3,
>>> and currently, Xen programs the same real %cr3 into hardware.
>>>
>>> If we want Xen to have a non-uniform layout, are two options are:
>>> * Fix Linux to have the same non-uniform layout that Xen wants
>>> (Backwards compatibility for older 64bit PV guests can be achieved with
>>> xen-shim).
>>> * Make use XPTI algorithm (specifically, the pagetable sync/copy part)
>>> forever more in the future.
>>>
>>> Option 2 isn't great (especially for perf on fixed hardware), but does
>>> keep all the necessary changes in Xen.  Option 1 looks to be the better
>>> option longterm.
>>>
>>> As an interesting point to note.  The 32bit PV ABI prohibits sharing of
>>> L3 pagetables, because back in the 32bit hypervisor days, we used to
>>> have linear mappings in the Xen virtual range.  This check is stale
>>> (from a functionality point of view), but still present in Xen.  A
>>> consequence of this is that 32bit PV guests definitely don't share
>>> top-level pagetables across vcpus.
>>
>> Correction: 32bit PV ABI prohibits sharing of L2 pagetables, but L3
>> pagetables can be shared. So guests will schedule the same top-level
>> pagetables across vcpus. >
>> But, 64bit Xen creates a monitor table for 32bit PAE guest and put the
>> CR3 provided by guest to the first slot, so pcpus don't share the same
>> L4 pagetables. The property we want still holds.
> 
> Ah, right -- but Xen can get away with this because in PAE mode, "L3" is
> just 4 entries that are loaded on CR3-switch and not automatically kept
> in sync by the hardware; i.e., the OS already needs to do its own
> "manual syncing" if it updates any of the L3 entires; so it's the same
> for Xen.
> 
>>> Juergen/Boris: Do you have any idea if/how easy this infrastructure
>>> would be to implement for 64bit PV guests as well?  If a PV guest can
>>> advertise via Elfnote that it won't share top-level pagetables, then we
>>> can audit this trivially in Xen.
>>>
>>
>> After reading Linux kernel code, I think it is not going to be trivial.
>> As now threads in Linux share one pagetable (as it should be).
>>
>> In order to make each thread has its own pagetable while still maintain
>> the illusion of one address space, there needs to be synchronisation
>> under the hood.
>>
>> There is code in Linux to synchronise 

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-12-10 Thread George Dunlap
On 12/7/18 6:40 PM, Wei Liu wrote:
> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
>> Hello,
>>
>> This is an accumulation and summary of various tasks which have been
>> discussed since the revelation of the speculative security issues in
>> January, and also an invitation to discuss alternative ideas.  They are
>> x86 specific, but a lot of the principles are architecture-agnostic.
>>
>> 1) A secrets-free hypervisor.
>>
>> Basically every hypercall can be (ab)used by a guest, and used as an
>> arbitrary cache-load gadget.  Logically, this is the first half of a
>> Spectre SP1 gadget, and is usually the first stepping stone to
>> exploiting one of the speculative sidechannels.
>>
>> Short of compiling Xen with LLVM's Speculative Load Hardening (which is
>> still experimental, and comes with a ~30% perf hit in the common case),
>> this is unavoidable.  Furthermore, throwing a few array_index_nospec()
>> into the code isn't a viable solution to the problem.
>>
>> An alternative option is to have less data mapped into Xen's virtual
>> address space - if a piece of memory isn't mapped, it can't be loaded
>> into the cache.
>>
>> An easy first step here is to remove Xen's directmap, which will mean
>> that guests general RAM isn't mapped by default into Xen's address
>> space.  This will come with some performance hit, as the
>> map_domain_page() infrastructure will now have to actually
>> create/destroy mappings, but removing the directmap will cause an
>> improvement for non-speculative security as well (No possibility of
>> ret2dir as an exploit technique).
>>
>> Beyond the directmap, there are plenty of other interesting secrets in
>> the Xen heap and other mappings, such as the stacks of the other pcpus. 
>> Fixing this requires moving Xen to having a non-uniform memory layout,
>> and this is much harder to change.  I already experimented with this as
>> a meltdown mitigation around about a year ago, and posted the resulting
>> series on Jan 4th,
>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
>> some trivial bits of which have already found their way upstream.
>>
>> To have a non-uniform memory layout, Xen may not share L4 pagetables. 
>> i.e. Xen must never have two pcpus which reference the same pagetable in
>> %cr3.
>>
>> This property already holds for 32bit PV guests, and all HVM guests, but
>> 64bit PV guests are the sticking point.  Because Linux has a flat memory
>> layout, when a 64bit PV guest schedules two threads from the same
>> process on separate vcpus, those two vcpus have the same virtual %cr3,
>> and currently, Xen programs the same real %cr3 into hardware.
>>
>> If we want Xen to have a non-uniform layout, are two options are:
>> * Fix Linux to have the same non-uniform layout that Xen wants
>> (Backwards compatibility for older 64bit PV guests can be achieved with
>> xen-shim).
>> * Make use XPTI algorithm (specifically, the pagetable sync/copy part)
>> forever more in the future.
>>
>> Option 2 isn't great (especially for perf on fixed hardware), but does
>> keep all the necessary changes in Xen.  Option 1 looks to be the better
>> option longterm.
>>
>> As an interesting point to note.  The 32bit PV ABI prohibits sharing of
>> L3 pagetables, because back in the 32bit hypervisor days, we used to
>> have linear mappings in the Xen virtual range.  This check is stale
>> (from a functionality point of view), but still present in Xen.  A
>> consequence of this is that 32bit PV guests definitely don't share
>> top-level pagetables across vcpus.
> 
> Correction: 32bit PV ABI prohibits sharing of L2 pagetables, but L3
> pagetables can be shared. So guests will schedule the same top-level
> pagetables across vcpus. >
> But, 64bit Xen creates a monitor table for 32bit PAE guest and put the
> CR3 provided by guest to the first slot, so pcpus don't share the same
> L4 pagetables. The property we want still holds.

Ah, right -- but Xen can get away with this because in PAE mode, "L3" is
just 4 entries that are loaded on CR3-switch and not automatically kept
in sync by the hardware; i.e., the OS already needs to do its own
"manual syncing" if it updates any of the L3 entires; so it's the same
for Xen.

>> Juergen/Boris: Do you have any idea if/how easy this infrastructure
>> would be to implement for 64bit PV guests as well?  If a PV guest can
>> advertise via Elfnote that it won't share top-level pagetables, then we
>> can audit this trivially in Xen.
>>
> 
> After reading Linux kernel code, I think it is not going to be trivial.
> As now threads in Linux share one pagetable (as it should be).
> 
> In order to make each thread has its own pagetable while still maintain
> the illusion of one address space, there needs to be synchronisation
> under the hood.
> 
> There is code in Linux to synchronise vmalloc, but that's only for the
> kernel portion. The infrastructure to synchronise userspace portion is
> missing.
> 
> One idea is to follow the 

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-12-07 Thread Wei Liu
On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
> Hello,
> 
> This is an accumulation and summary of various tasks which have been
> discussed since the revelation of the speculative security issues in
> January, and also an invitation to discuss alternative ideas.  They are
> x86 specific, but a lot of the principles are architecture-agnostic.
> 
> 1) A secrets-free hypervisor.
> 
> Basically every hypercall can be (ab)used by a guest, and used as an
> arbitrary cache-load gadget.  Logically, this is the first half of a
> Spectre SP1 gadget, and is usually the first stepping stone to
> exploiting one of the speculative sidechannels.
> 
> Short of compiling Xen with LLVM's Speculative Load Hardening (which is
> still experimental, and comes with a ~30% perf hit in the common case),
> this is unavoidable.  Furthermore, throwing a few array_index_nospec()
> into the code isn't a viable solution to the problem.
> 
> An alternative option is to have less data mapped into Xen's virtual
> address space - if a piece of memory isn't mapped, it can't be loaded
> into the cache.
> 
> An easy first step here is to remove Xen's directmap, which will mean
> that guests general RAM isn't mapped by default into Xen's address
> space.  This will come with some performance hit, as the
> map_domain_page() infrastructure will now have to actually
> create/destroy mappings, but removing the directmap will cause an
> improvement for non-speculative security as well (No possibility of
> ret2dir as an exploit technique).
> 
> Beyond the directmap, there are plenty of other interesting secrets in
> the Xen heap and other mappings, such as the stacks of the other pcpus. 
> Fixing this requires moving Xen to having a non-uniform memory layout,
> and this is much harder to change.  I already experimented with this as
> a meltdown mitigation around about a year ago, and posted the resulting
> series on Jan 4th,
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
> some trivial bits of which have already found their way upstream.
> 
> To have a non-uniform memory layout, Xen may not share L4 pagetables. 
> i.e. Xen must never have two pcpus which reference the same pagetable in
> %cr3.
> 
> This property already holds for 32bit PV guests, and all HVM guests, but
> 64bit PV guests are the sticking point.  Because Linux has a flat memory
> layout, when a 64bit PV guest schedules two threads from the same
> process on separate vcpus, those two vcpus have the same virtual %cr3,
> and currently, Xen programs the same real %cr3 into hardware.
> 
> If we want Xen to have a non-uniform layout, are two options are:
> * Fix Linux to have the same non-uniform layout that Xen wants
> (Backwards compatibility for older 64bit PV guests can be achieved with
> xen-shim).
> * Make use XPTI algorithm (specifically, the pagetable sync/copy part)
> forever more in the future.
> 
> Option 2 isn't great (especially for perf on fixed hardware), but does
> keep all the necessary changes in Xen.  Option 1 looks to be the better
> option longterm.
> 
> As an interesting point to note.  The 32bit PV ABI prohibits sharing of
> L3 pagetables, because back in the 32bit hypervisor days, we used to
> have linear mappings in the Xen virtual range.  This check is stale
> (from a functionality point of view), but still present in Xen.  A
> consequence of this is that 32bit PV guests definitely don't share
> top-level pagetables across vcpus.

Correction: 32bit PV ABI prohibits sharing of L2 pagetables, but L3
pagetables can be shared. So guests will schedule the same top-level
pagetables across vcpus. 

But, 64bit Xen creates a monitor table for 32bit PAE guest and put the
CR3 provided by guest to the first slot, so pcpus don't share the same
L4 pagetables. The property we want still holds.

> 
> Juergen/Boris: Do you have any idea if/how easy this infrastructure
> would be to implement for 64bit PV guests as well?  If a PV guest can
> advertise via Elfnote that it won't share top-level pagetables, then we
> can audit this trivially in Xen.
> 

After reading Linux kernel code, I think it is not going to be trivial.
As now threads in Linux share one pagetable (as it should be).

In order to make each thread has its own pagetable while still maintain
the illusion of one address space, there needs to be synchronisation
under the hood.

There is code in Linux to synchronise vmalloc, but that's only for the
kernel portion. The infrastructure to synchronise userspace portion is
missing.

One idea is to follow the same model as vmalloc -- maintain a reference
pagetable in struct mm and a list of pagetables for threads, then
synchronise the pagetables in the page fault handler. But this is
probably a bit hard to sell to Linux maintainers because it will touch a
lot of the non-Xen code, increase complexity and decrease performance.

Thoughts?

Wei.

___
Xen-devel mailing list

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Dario Faggioli
On Fri, 2018-10-26 at 06:01 -0600, Tamas K Lengyel wrote:
> On Fri, Oct 26, 2018, 1:49 AM Dario Faggioli 
> wrote:
> > 
> > I haven't done this kind of benchmark yet, but I'd say that, if
> > every
> > vCPU of every domain is doing 100% CPU intensive work, core-
> > scheduling
> > isn't going to make much difference, or help you much, as compared
> > to
> > regular scheduling with hyperthreading enabled.
> 
> Understood, we actually went into the this with the assumption that
> in such cases core-scheduling would underperform plain credit1. 
>
Which may actually happen. Or it might improve things a little, because
there are higher chances that a core only has 1 thread busy. But then
we're not really benchmarking core-scheduling vs. plain-scheduling,
we're benchmarking a side-effect of core-scheduling, which is not
equally interesting.

> The idea was to measure the worst case with plain scheduling and with
> core-scheduling to be able to see the difference clearly between the
> two.
> 
For the sake of benchmarking core-scheduling solutions, we should put
ourself in a position where what we measure is actually its own impact,
and I don't think this very workload put us there.

Then, of course, if this workload is relevant to you, you indeed have
the right and should benchmark and evaluate it, and we're always
interested in hearing what you find out. :-)

> > Actual numbers may vary depending on whether VMs have odd or even
> > number of vCPUs but, e.g., on hardware with 2 threads per core, and
> > using VMs with at least 2 vCPUs each, the _perfect_ implementation
> > of
> > core-scheduling would still manage to keep all the *threads* busy,
> > which is --as far as our speculations currently go-- what is
> > causing
> > the performance degradation you're seeing.
> > 
> > So, again, if it is confirmed that this workload of yours is a
> > particularly bad one for SMT, then you are just better off
> > disabling
> > hyperthreading. And, no, I don't think such a situation is common
> > enough to say "let's disable for everyone by default".
> 
> I wasn't asking to make it the default in Xen but if we make it the
> default for our deployment where such workloads are entirely
> possible, would that be reasonable. 
>
It all comes to how common a situation where you have a massively
oversubscribed system, with a fully CPU-bound workload, for significant
chunks of time.

As said in a previous email, I think that, if this is common enough,
and it is not something just transient, you'll are in trouble anyway.
And if it's not causing you/your customers troubles already, it might
not be that common, and hence it wouldn't be necessary/wise to disable
SMT.

But of course, you know your workload, and your requirements, much more
than me. If this kind of load really is what you experience, or what
you want to target, then yes, apparently disabling SMT is your best way
to go.

> If there are
> tests that I can run which are the "best case" for hyperthreading, I
> would like to repeat those tests to see where we are.
> 
If we come up with a good enough synthetic benchmark, I'll let you
know.

Regards,
Dario
-- 
<> (Raistlin Majere)
-
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/


signature.asc
Description: This is a digitally signed message part
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Tamas K Lengyel
On Fri, Oct 26, 2018, 1:49 AM Dario Faggioli  wrote:

> On Thu, 2018-10-25 at 12:35 -0600, Tamas K Lengyel wrote:
> > On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
> >  wrote:
> > >
> > > TBH, I'd perhaps start with an admin control which lets them switch
> > > between the two modes, and some instructions on how/why they might
> > > want
> > > to try switching.
> > >
> > > Trying to second-guess the best HT setting automatically is most
> > > likely
> > > going to be a lost cause.  It will be system specific as to whether
> > > the
> > > same workload is better with or without HT.
> >
> > This may just not be practically possible at the end as the system
> > administrator may have no idea what workload will be running on any
> > given system. It may also vary between one user to the next on the
> > same system, without the users being allowed to tune such details of
> > the system. If we can show that with core-scheduling deployed for
> > most
> > workloads performance is improved by x % it may be a safe option.
> >
> I haven't done this kind of benchmark yet, but I'd say that, if every
> vCPU of every domain is doing 100% CPU intensive work, core-scheduling
> isn't going to make much difference, or help you much, as compared to
> regular scheduling with hyperthreading enabled.
>

Understood, we actually went into the this with the assumption that in such
cases core-scheduling would underperform plain credit1. The idea was to
measure the worst case with plain scheduling and with core-scheduling to be
able to see the difference clearly between the two.


> Actual numbers may vary depending on whether VMs have odd or even
> number of vCPUs but, e.g., on hardware with 2 threads per core, and
> using VMs with at least 2 vCPUs each, the _perfect_ implementation of
> core-scheduling would still manage to keep all the *threads* busy,
> which is --as far as our speculations currently go-- what is causing
> the performance degradation you're seeing.
>
> So, again, if it is confirmed that this workload of yours is a
> particularly bad one for SMT, then you are just better off disabling
> hyperthreading. And, no, I don't think such a situation is common
> enough to say "let's disable for everyone by default".
>

I wasn't asking to make it the default in Xen but if we make it the default
for our deployment where such workloads are entirely possible, would that
be reasonable. Again, we don't know the workload and we can't predict it.
We were hoping to use core-scheduling eventually but it was not expected
that hyperthreading can cause such drops in performance. If there are tests
that I can run which are the "best case" for hyperthreading, I would like
to repeat those tests to see where we are.

Thanks,
Tamas
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Jan Beulich
>>> On 26.10.18 at 13:43,  wrote:
> On 10/26/2018 12:33 PM, Jan Beulich wrote:
> On 26.10.18 at 13:24,  wrote:
>>> On 10/26/2018 12:20 PM, Jan Beulich wrote:
>>> On 26.10.18 at 12:51,  wrote:
> The basic solution involves having a xenheap virtual address mapping
> area not tied to the physical layout of the memory.  domheap and xenheap
> memory would have to come from the same pool, but xenheap would need to
> be mapped into the xenheap virtual memory region before being returned.

 Wouldn't this most easily be done by making alloc_xenheap_pages()
 call alloc_domheap_pages() and then vmap() the result? Of course
 we may need to grow the vmap area in that case.
>>>
>>> I couldn't answer that question without a lot more digging. :-)  I'd
>>> always assumed that the reason for the original reason for having the
>>> xenheap direct-mapped on 32-bit was something to do with early-boot
>>> allocation; if there is something tricky there, we'd need to
>>> special-case the early-boot allocation somehow.
>> 
>> The reason for the split on 32-bit was simply the lack of sufficient
>> VA space.
> 
> That tells me why the domheap was *not* direct-mapped; but it doesn't
> tell me why the xenheap *was*.  Was it perhaps just something that
> evolved from what we inherited from Linux?

Presumably, but there I'm really the wrong one to ask. When I joined,
things had long been that way.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread George Dunlap
On 10/26/2018 12:33 PM, Jan Beulich wrote:
 On 26.10.18 at 13:24,  wrote:
>> On 10/26/2018 12:20 PM, Jan Beulich wrote:
>> On 26.10.18 at 12:51,  wrote:
 The basic solution involves having a xenheap virtual address mapping
 area not tied to the physical layout of the memory.  domheap and xenheap
 memory would have to come from the same pool, but xenheap would need to
 be mapped into the xenheap virtual memory region before being returned.
>>>
>>> Wouldn't this most easily be done by making alloc_xenheap_pages()
>>> call alloc_domheap_pages() and then vmap() the result? Of course
>>> we may need to grow the vmap area in that case.
>>
>> I couldn't answer that question without a lot more digging. :-)  I'd
>> always assumed that the reason for the original reason for having the
>> xenheap direct-mapped on 32-bit was something to do with early-boot
>> allocation; if there is something tricky there, we'd need to
>> special-case the early-boot allocation somehow.
> 
> The reason for the split on 32-bit was simply the lack of sufficient
> VA space.

That tells me why the domheap was *not* direct-mapped; but it doesn't
tell me why the xenheap *was*.  Was it perhaps just something that
evolved from what we inherited from Linux?

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Jan Beulich
>>> On 26.10.18 at 13:24,  wrote:
> On 10/26/2018 12:20 PM, Jan Beulich wrote:
> On 26.10.18 at 12:51,  wrote:
>>> The basic solution involves having a xenheap virtual address mapping
>>> area not tied to the physical layout of the memory.  domheap and xenheap
>>> memory would have to come from the same pool, but xenheap would need to
>>> be mapped into the xenheap virtual memory region before being returned.
>> 
>> Wouldn't this most easily be done by making alloc_xenheap_pages()
>> call alloc_domheap_pages() and then vmap() the result? Of course
>> we may need to grow the vmap area in that case.
> 
> I couldn't answer that question without a lot more digging. :-)  I'd
> always assumed that the reason for the original reason for having the
> xenheap direct-mapped on 32-bit was something to do with early-boot
> allocation; if there is something tricky there, we'd need to
> special-case the early-boot allocation somehow.

The reason for the split on 32-bit was simply the lack of sufficient
VA space.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread George Dunlap
On 10/26/2018 12:20 PM, Jan Beulich wrote:
 On 26.10.18 at 12:51,  wrote:
>> On 10/26/2018 10:56 AM, Jan Beulich wrote:
>> On 26.10.18 at 11:28,  wrote:
 On Fri, Oct 26, 2018 at 03:16:15AM -0600, Jan Beulich wrote:
 On 25.10.18 at 18:29,  wrote:
>> A split xenheap model means that data pertaining to other guests isn't
>> mapped in the context of this vcpu, so cannot be brought into the cache.
>
> It was not clear to me from Wei's original mail that talk here is
> about "split" in a sense of "per-domain"; I was assuming the
> CONFIG_SEPARATE_XENHEAP mode instead.

 The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
 I what I wanted most is the partial direct map which reduces the amount
 of data mapped inside Xen context -- the original idea was removing
 direct map discussed during one of the calls IIRC. I thought making the
 partial direct map mode work and make it as small as possible will get
 us 90% there.

 The "per-domain" heap is a different work item.
>>>
>>> But if we mean to go that route, going (back) to the separate
>>> Xen heap model seems just like an extra complication to me.
>>> Yet I agree that this would remove the need for a fair chunk of
>>> the direct map. Otoh a statically partitioned Xen heap would
>>> bring back scalability issues which we had specifically meant to
>>> get rid of by moving away from that model.
>>
>> I think turning SEPARATE_XENHEAP back on would just be the first step.
>> We definitely would then need to sort things out so that it's scalable
>> again.
>>
>> After system set-up, the key difference between xenheap and domheap
>> pages is that xenheap pages are assumed to be always mapped (i.e., you
>> can keep a pointer to them and it will be valid), whereas domheap pages
>> cannot assumed to be mapped, and need to be wrapped with
>> [un]map_domain_page().
>>
>> The basic solution involves having a xenheap virtual address mapping
>> area not tied to the physical layout of the memory.  domheap and xenheap
>> memory would have to come from the same pool, but xenheap would need to
>> be mapped into the xenheap virtual memory region before being returned.
> 
> Wouldn't this most easily be done by making alloc_xenheap_pages()
> call alloc_domheap_pages() and then vmap() the result? Of course
> we may need to grow the vmap area in that case.

I couldn't answer that question without a lot more digging. :-)  I'd
always assumed that the reason for the original reason for having the
xenheap direct-mapped on 32-bit was something to do with early-boot
allocation; if there is something tricky there, we'd need to
special-case the early-boot allocation somehow.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Jan Beulich
>>> On 26.10.18 at 12:51,  wrote:
> On 10/26/2018 10:56 AM, Jan Beulich wrote:
> On 26.10.18 at 11:28,  wrote:
>>> On Fri, Oct 26, 2018 at 03:16:15AM -0600, Jan Beulich wrote:
>>> On 25.10.18 at 18:29,  wrote:
> A split xenheap model means that data pertaining to other guests isn't
> mapped in the context of this vcpu, so cannot be brought into the cache.

 It was not clear to me from Wei's original mail that talk here is
 about "split" in a sense of "per-domain"; I was assuming the
 CONFIG_SEPARATE_XENHEAP mode instead.
>>>
>>> The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
>>> I what I wanted most is the partial direct map which reduces the amount
>>> of data mapped inside Xen context -- the original idea was removing
>>> direct map discussed during one of the calls IIRC. I thought making the
>>> partial direct map mode work and make it as small as possible will get
>>> us 90% there.
>>>
>>> The "per-domain" heap is a different work item.
>> 
>> But if we mean to go that route, going (back) to the separate
>> Xen heap model seems just like an extra complication to me.
>> Yet I agree that this would remove the need for a fair chunk of
>> the direct map. Otoh a statically partitioned Xen heap would
>> bring back scalability issues which we had specifically meant to
>> get rid of by moving away from that model.
> 
> I think turning SEPARATE_XENHEAP back on would just be the first step.
> We definitely would then need to sort things out so that it's scalable
> again.
> 
> After system set-up, the key difference between xenheap and domheap
> pages is that xenheap pages are assumed to be always mapped (i.e., you
> can keep a pointer to them and it will be valid), whereas domheap pages
> cannot assumed to be mapped, and need to be wrapped with
> [un]map_domain_page().
> 
> The basic solution involves having a xenheap virtual address mapping
> area not tied to the physical layout of the memory.  domheap and xenheap
> memory would have to come from the same pool, but xenheap would need to
> be mapped into the xenheap virtual memory region before being returned.

Wouldn't this most easily be done by making alloc_xenheap_pages()
call alloc_domheap_pages() and then vmap() the result? Of course
we may need to grow the vmap area in that case.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread George Dunlap
On 10/26/2018 10:56 AM, Jan Beulich wrote:
 On 26.10.18 at 11:28,  wrote:
>> On Fri, Oct 26, 2018 at 03:16:15AM -0600, Jan Beulich wrote:
>> On 25.10.18 at 18:29,  wrote:
 A split xenheap model means that data pertaining to other guests isn't
 mapped in the context of this vcpu, so cannot be brought into the cache.
>>>
>>> It was not clear to me from Wei's original mail that talk here is
>>> about "split" in a sense of "per-domain"; I was assuming the
>>> CONFIG_SEPARATE_XENHEAP mode instead.
>>
>> The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
>> I what I wanted most is the partial direct map which reduces the amount
>> of data mapped inside Xen context -- the original idea was removing
>> direct map discussed during one of the calls IIRC. I thought making the
>> partial direct map mode work and make it as small as possible will get
>> us 90% there.
>>
>> The "per-domain" heap is a different work item.
> 
> But if we mean to go that route, going (back) to the separate
> Xen heap model seems just like an extra complication to me.
> Yet I agree that this would remove the need for a fair chunk of
> the direct map. Otoh a statically partitioned Xen heap would
> bring back scalability issues which we had specifically meant to
> get rid of by moving away from that model.

I think turning SEPARATE_XENHEAP back on would just be the first step.
We definitely would then need to sort things out so that it's scalable
again.

After system set-up, the key difference between xenheap and domheap
pages is that xenheap pages are assumed to be always mapped (i.e., you
can keep a pointer to them and it will be valid), whereas domheap pages
cannot assumed to be mapped, and need to be wrapped with
[un]map_domain_page().

The basic solution involves having a xenheap virtual address mapping
area not tied to the physical layout of the memory.  domheap and xenheap
memory would have to come from the same pool, but xenheap would need to
be mapped into the xenheap virtual memory region before being returned.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread George Dunlap
On 10/25/2018 07:13 PM, Andrew Cooper wrote:
> On 25/10/18 18:58, Tamas K Lengyel wrote:
>> On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
>>  wrote:
>>> On 25/10/18 18:35, Tamas K Lengyel wrote:
 On Thu, Oct 25, 2018 at 11:02 AM George Dunlap  
 wrote:
> On 10/25/2018 05:55 PM, Andrew Cooper wrote:
>> On 24/10/18 16:24, Tamas K Lengyel wrote:
 A solution to this issue was proposed, whereby Xen synchronises 
 siblings
 on vmexit/entry, so we are never executing code in two different
 privilege levels.  Getting this working would make it safe to continue
 using hyperthreading even in the presence of L1TF.  Obviously, its 
 going
 to come in perf hit, but compared to disabling hyperthreading, all its
 got to do is beat a 60% perf hit to make it the preferable option for
 making your system L1TF-proof.
>>> Could you shed some light what tests were done where that 60%
>>> performance hit was observed? We have performed intensive stress-tests
>>> to confirm this but according to our findings turning off
>>> hyper-threading is actually improving performance on all machines we
>>> tested thus far.
>> Aggregate inter and intra host disk and network throughput, which is a
>> reasonable approximation of a load of webserver VM's on a single
>> physical server.  Small packet IO was hit worst, as it has a very high
>> vcpu context switch rate between dom0 and domU.  Disabling HT means you
>> have half the number of logical cores to schedule on, which doubles the
>> mean time to next timeslice.
>>
>> In principle, for a fully optimised workload, HT gets you ~30% extra due
>> to increased utilisation of the pipeline functional units.  Some
>> resources are statically partitioned, while some are competitively
>> shared, and its now been well proven that actions on one thread can have
>> a large effect on others.
>>
>> Two arbitrary vcpus are not an optimised workload.  If the perf
>> improvement you get from not competing in the pipeline is greater than
>> the perf loss from Xen's reduced capability to schedule, then disabling
>> HT would be an improvement.  I can certainly believe that this might be
>> the case for Qubes style workloads where you are probably not very
>> overprovisioned, and you probably don't have long running IO and CPU
>> bound tasks in the VMs.
> As another data point, I think it was MSCI who said they always disabled
> hyperthreading, because they also found that their workloads ran slower
> with HT than without.  Presumably they were doing massive number
> crunching, such that each thread was waiting on the ALU a significant
> portion of the time anyway; at which point the superscalar scheduling
> and/or reduction in cache efficiency would have brought performance from
> "no benefit" down to "negative benefit".
>
 Thanks for the insights. Indeed, we are primarily concerned with
 performance of Qubes-style workloads which may range from
 no-oversubscription to heavily oversubscribed. It's not a workload we
 can predict or optimize before-hand, so we are looking for a default
 that would be 1) safe and 2) performant in the most general case
 possible.
>>> So long as you've got the XSA-273 patches, you should be able to park
>>> and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
>>>
>>> You should be able to effectively change hyperthreading configuration at
>>> runtime.  It's not quite the same as changing it in the BIOS, but from a
>>> competition of pipeline resources, it should be good enough.
>>>
>> Thanks, indeed that is a handy tool to have. We often can't disable
>> hyperthreading in the BIOS anyway because most BIOS' don't allow you
>> to do that when TXT is used.
> 
> Hmm - that's an odd restriction.  I don't immediately see why such a
> restriction would be necessary.
> 
>> That said, with this tool we still
>> require some way to determine when to do parking/reactivation of
>> hyperthreads. We could certainly park hyperthreads when we see the
>> system is being oversubscribed in terms of number of vCPUs being
>> active, but for real optimization we would have to understand the
>> workloads running within the VMs if I understand correctly?
> 
> TBH, I'd perhaps start with an admin control which lets them switch
> between the two modes, and some instructions on how/why they might want
> to try switching.
> 
> Trying to second-guess the best HT setting automatically is most likely
> going to be a lost cause.  It will be system specific as to whether the
> same workload is better with or without HT.

There may be hardware-specific performance counters that could be used
to detect when pathological cases are happening.  But that would need to
be implemented and/or re-verified on basically every new piece of hardware.

 -George



Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Jan Beulich
>>> On 26.10.18 at 11:28,  wrote:
> On Fri, Oct 26, 2018 at 03:16:15AM -0600, Jan Beulich wrote:
>> >>> On 25.10.18 at 18:29,  wrote:
>> > A split xenheap model means that data pertaining to other guests isn't
>> > mapped in the context of this vcpu, so cannot be brought into the cache.
>> 
>> It was not clear to me from Wei's original mail that talk here is
>> about "split" in a sense of "per-domain"; I was assuming the
>> CONFIG_SEPARATE_XENHEAP mode instead.
> 
> The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
> I what I wanted most is the partial direct map which reduces the amount
> of data mapped inside Xen context -- the original idea was removing
> direct map discussed during one of the calls IIRC. I thought making the
> partial direct map mode work and make it as small as possible will get
> us 90% there.
> 
> The "per-domain" heap is a different work item.

But if we mean to go that route, going (back) to the separate
Xen heap model seems just like an extra complication to me.
Yet I agree that this would remove the need for a fair chunk of
the direct map. Otoh a statically partitioned Xen heap would
bring back scalability issues which we had specifically meant to
get rid of by moving away from that model.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Wei Liu
On Fri, Oct 26, 2018 at 03:16:15AM -0600, Jan Beulich wrote:
> >>> On 25.10.18 at 18:29,  wrote:
> > A split xenheap model means that data pertaining to other guests isn't
> > mapped in the context of this vcpu, so cannot be brought into the cache.
> 
> It was not clear to me from Wei's original mail that talk here is
> about "split" in a sense of "per-domain"; I was assuming the
> CONFIG_SEPARATE_XENHEAP mode instead.

The split heap was indeed referring to CONFIG_SEPARATE_XENHEAP mode, yet
I what I wanted most is the partial direct map which reduces the amount
of data mapped inside Xen context -- the original idea was removing
direct map discussed during one of the calls IIRC. I thought making the
partial direct map mode work and make it as small as possible will get
us 90% there.

The "per-domain" heap is a different work item.

Wei.

> 
> Jan
> 
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-26 Thread Jan Beulich
>>> On 25.10.18 at 18:29,  wrote:
> A split xenheap model means that data pertaining to other guests isn't
> mapped in the context of this vcpu, so cannot be brought into the cache.

It was not clear to me from Wei's original mail that talk here is
about "split" in a sense of "per-domain"; I was assuming the
CONFIG_SEPARATE_XENHEAP mode instead.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Andrew Cooper
On 25/10/18 19:35, Tamas K Lengyel wrote:
> On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
>  wrote:
>> On 25/10/18 18:58, Tamas K Lengyel wrote:
>>> On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
>>>  wrote:
 On 25/10/18 18:35, Tamas K Lengyel wrote:
> On Thu, Oct 25, 2018 at 11:02 AM George Dunlap  
> wrote:
>> On 10/25/2018 05:55 PM, Andrew Cooper wrote:
>>> On 24/10/18 16:24, Tamas K Lengyel wrote:
> A solution to this issue was proposed, whereby Xen synchronises 
> siblings
> on vmexit/entry, so we are never executing code in two different
> privilege levels.  Getting this working would make it safe to continue
> using hyperthreading even in the presence of L1TF.  Obviously, its 
> going
> to come in perf hit, but compared to disabling hyperthreading, all its
> got to do is beat a 60% perf hit to make it the preferable option for
> making your system L1TF-proof.
 Could you shed some light what tests were done where that 60%
 performance hit was observed? We have performed intensive stress-tests
 to confirm this but according to our findings turning off
 hyper-threading is actually improving performance on all machines we
 tested thus far.
>>> Aggregate inter and intra host disk and network throughput, which is a
>>> reasonable approximation of a load of webserver VM's on a single
>>> physical server.  Small packet IO was hit worst, as it has a very high
>>> vcpu context switch rate between dom0 and domU.  Disabling HT means you
>>> have half the number of logical cores to schedule on, which doubles the
>>> mean time to next timeslice.
>>>
>>> In principle, for a fully optimised workload, HT gets you ~30% extra due
>>> to increased utilisation of the pipeline functional units.  Some
>>> resources are statically partitioned, while some are competitively
>>> shared, and its now been well proven that actions on one thread can have
>>> a large effect on others.
>>>
>>> Two arbitrary vcpus are not an optimised workload.  If the perf
>>> improvement you get from not competing in the pipeline is greater than
>>> the perf loss from Xen's reduced capability to schedule, then disabling
>>> HT would be an improvement.  I can certainly believe that this might be
>>> the case for Qubes style workloads where you are probably not very
>>> overprovisioned, and you probably don't have long running IO and CPU
>>> bound tasks in the VMs.
>> As another data point, I think it was MSCI who said they always disabled
>> hyperthreading, because they also found that their workloads ran slower
>> with HT than without.  Presumably they were doing massive number
>> crunching, such that each thread was waiting on the ALU a significant
>> portion of the time anyway; at which point the superscalar scheduling
>> and/or reduction in cache efficiency would have brought performance from
>> "no benefit" down to "negative benefit".
>>
> Thanks for the insights. Indeed, we are primarily concerned with
> performance of Qubes-style workloads which may range from
> no-oversubscription to heavily oversubscribed. It's not a workload we
> can predict or optimize before-hand, so we are looking for a default
> that would be 1) safe and 2) performant in the most general case
> possible.
 So long as you've got the XSA-273 patches, you should be able to park
 and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} 
 $CPU`.

 You should be able to effectively change hyperthreading configuration at
 runtime.  It's not quite the same as changing it in the BIOS, but from a
 competition of pipeline resources, it should be good enough.

>>> Thanks, indeed that is a handy tool to have. We often can't disable
>>> hyperthreading in the BIOS anyway because most BIOS' don't allow you
>>> to do that when TXT is used.
>> Hmm - that's an odd restriction.  I don't immediately see why such a
>> restriction would be necessary.
>>
>>> That said, with this tool we still
>>> require some way to determine when to do parking/reactivation of
>>> hyperthreads. We could certainly park hyperthreads when we see the
>>> system is being oversubscribed in terms of number of vCPUs being
>>> active, but for real optimization we would have to understand the
>>> workloads running within the VMs if I understand correctly?
>> TBH, I'd perhaps start with an admin control which lets them switch
>> between the two modes, and some instructions on how/why they might want
>> to try switching.
>>
>> Trying to second-guess the best HT setting automatically is most likely
>> going to be a lost cause.  It will be system specific as to whether the
>> same workload is better with or without HT.
> This may just not be practically possible at the end as the system
> administrator may have no 

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Tamas K Lengyel
On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
 wrote:
>
> On 25/10/18 18:58, Tamas K Lengyel wrote:
> > On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
> >  wrote:
> >> On 25/10/18 18:35, Tamas K Lengyel wrote:
> >>> On Thu, Oct 25, 2018 at 11:02 AM George Dunlap  
> >>> wrote:
>  On 10/25/2018 05:55 PM, Andrew Cooper wrote:
> > On 24/10/18 16:24, Tamas K Lengyel wrote:
> >>> A solution to this issue was proposed, whereby Xen synchronises 
> >>> siblings
> >>> on vmexit/entry, so we are never executing code in two different
> >>> privilege levels.  Getting this working would make it safe to continue
> >>> using hyperthreading even in the presence of L1TF.  Obviously, its 
> >>> going
> >>> to come in perf hit, but compared to disabling hyperthreading, all its
> >>> got to do is beat a 60% perf hit to make it the preferable option for
> >>> making your system L1TF-proof.
> >> Could you shed some light what tests were done where that 60%
> >> performance hit was observed? We have performed intensive stress-tests
> >> to confirm this but according to our findings turning off
> >> hyper-threading is actually improving performance on all machines we
> >> tested thus far.
> > Aggregate inter and intra host disk and network throughput, which is a
> > reasonable approximation of a load of webserver VM's on a single
> > physical server.  Small packet IO was hit worst, as it has a very high
> > vcpu context switch rate between dom0 and domU.  Disabling HT means you
> > have half the number of logical cores to schedule on, which doubles the
> > mean time to next timeslice.
> >
> > In principle, for a fully optimised workload, HT gets you ~30% extra due
> > to increased utilisation of the pipeline functional units.  Some
> > resources are statically partitioned, while some are competitively
> > shared, and its now been well proven that actions on one thread can have
> > a large effect on others.
> >
> > Two arbitrary vcpus are not an optimised workload.  If the perf
> > improvement you get from not competing in the pipeline is greater than
> > the perf loss from Xen's reduced capability to schedule, then disabling
> > HT would be an improvement.  I can certainly believe that this might be
> > the case for Qubes style workloads where you are probably not very
> > overprovisioned, and you probably don't have long running IO and CPU
> > bound tasks in the VMs.
>  As another data point, I think it was MSCI who said they always disabled
>  hyperthreading, because they also found that their workloads ran slower
>  with HT than without.  Presumably they were doing massive number
>  crunching, such that each thread was waiting on the ALU a significant
>  portion of the time anyway; at which point the superscalar scheduling
>  and/or reduction in cache efficiency would have brought performance from
>  "no benefit" down to "negative benefit".
> 
> >>> Thanks for the insights. Indeed, we are primarily concerned with
> >>> performance of Qubes-style workloads which may range from
> >>> no-oversubscription to heavily oversubscribed. It's not a workload we
> >>> can predict or optimize before-hand, so we are looking for a default
> >>> that would be 1) safe and 2) performant in the most general case
> >>> possible.
> >> So long as you've got the XSA-273 patches, you should be able to park
> >> and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} 
> >> $CPU`.
> >>
> >> You should be able to effectively change hyperthreading configuration at
> >> runtime.  It's not quite the same as changing it in the BIOS, but from a
> >> competition of pipeline resources, it should be good enough.
> >>
> > Thanks, indeed that is a handy tool to have. We often can't disable
> > hyperthreading in the BIOS anyway because most BIOS' don't allow you
> > to do that when TXT is used.
>
> Hmm - that's an odd restriction.  I don't immediately see why such a
> restriction would be necessary.
>
> > That said, with this tool we still
> > require some way to determine when to do parking/reactivation of
> > hyperthreads. We could certainly park hyperthreads when we see the
> > system is being oversubscribed in terms of number of vCPUs being
> > active, but for real optimization we would have to understand the
> > workloads running within the VMs if I understand correctly?
>
> TBH, I'd perhaps start with an admin control which lets them switch
> between the two modes, and some instructions on how/why they might want
> to try switching.
>
> Trying to second-guess the best HT setting automatically is most likely
> going to be a lost cause.  It will be system specific as to whether the
> same workload is better with or without HT.

This may just not be practically possible at the end as the system
administrator may have no idea what workload will be running on any
given 

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Andrew Cooper
On 25/10/18 18:58, Tamas K Lengyel wrote:
> On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
>  wrote:
>> On 25/10/18 18:35, Tamas K Lengyel wrote:
>>> On Thu, Oct 25, 2018 at 11:02 AM George Dunlap  
>>> wrote:
 On 10/25/2018 05:55 PM, Andrew Cooper wrote:
> On 24/10/18 16:24, Tamas K Lengyel wrote:
>>> A solution to this issue was proposed, whereby Xen synchronises siblings
>>> on vmexit/entry, so we are never executing code in two different
>>> privilege levels.  Getting this working would make it safe to continue
>>> using hyperthreading even in the presence of L1TF.  Obviously, its going
>>> to come in perf hit, but compared to disabling hyperthreading, all its
>>> got to do is beat a 60% perf hit to make it the preferable option for
>>> making your system L1TF-proof.
>> Could you shed some light what tests were done where that 60%
>> performance hit was observed? We have performed intensive stress-tests
>> to confirm this but according to our findings turning off
>> hyper-threading is actually improving performance on all machines we
>> tested thus far.
> Aggregate inter and intra host disk and network throughput, which is a
> reasonable approximation of a load of webserver VM's on a single
> physical server.  Small packet IO was hit worst, as it has a very high
> vcpu context switch rate between dom0 and domU.  Disabling HT means you
> have half the number of logical cores to schedule on, which doubles the
> mean time to next timeslice.
>
> In principle, for a fully optimised workload, HT gets you ~30% extra due
> to increased utilisation of the pipeline functional units.  Some
> resources are statically partitioned, while some are competitively
> shared, and its now been well proven that actions on one thread can have
> a large effect on others.
>
> Two arbitrary vcpus are not an optimised workload.  If the perf
> improvement you get from not competing in the pipeline is greater than
> the perf loss from Xen's reduced capability to schedule, then disabling
> HT would be an improvement.  I can certainly believe that this might be
> the case for Qubes style workloads where you are probably not very
> overprovisioned, and you probably don't have long running IO and CPU
> bound tasks in the VMs.
 As another data point, I think it was MSCI who said they always disabled
 hyperthreading, because they also found that their workloads ran slower
 with HT than without.  Presumably they were doing massive number
 crunching, such that each thread was waiting on the ALU a significant
 portion of the time anyway; at which point the superscalar scheduling
 and/or reduction in cache efficiency would have brought performance from
 "no benefit" down to "negative benefit".

>>> Thanks for the insights. Indeed, we are primarily concerned with
>>> performance of Qubes-style workloads which may range from
>>> no-oversubscription to heavily oversubscribed. It's not a workload we
>>> can predict or optimize before-hand, so we are looking for a default
>>> that would be 1) safe and 2) performant in the most general case
>>> possible.
>> So long as you've got the XSA-273 patches, you should be able to park
>> and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
>>
>> You should be able to effectively change hyperthreading configuration at
>> runtime.  It's not quite the same as changing it in the BIOS, but from a
>> competition of pipeline resources, it should be good enough.
>>
> Thanks, indeed that is a handy tool to have. We often can't disable
> hyperthreading in the BIOS anyway because most BIOS' don't allow you
> to do that when TXT is used.

Hmm - that's an odd restriction.  I don't immediately see why such a
restriction would be necessary.

> That said, with this tool we still
> require some way to determine when to do parking/reactivation of
> hyperthreads. We could certainly park hyperthreads when we see the
> system is being oversubscribed in terms of number of vCPUs being
> active, but for real optimization we would have to understand the
> workloads running within the VMs if I understand correctly?

TBH, I'd perhaps start with an admin control which lets them switch
between the two modes, and some instructions on how/why they might want
to try switching.

Trying to second-guess the best HT setting automatically is most likely
going to be a lost cause.  It will be system specific as to whether the
same workload is better with or without HT.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Tamas K Lengyel
On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
 wrote:
>
> On 25/10/18 18:35, Tamas K Lengyel wrote:
> > On Thu, Oct 25, 2018 at 11:02 AM George Dunlap  
> > wrote:
> >> On 10/25/2018 05:55 PM, Andrew Cooper wrote:
> >>> On 24/10/18 16:24, Tamas K Lengyel wrote:
> > A solution to this issue was proposed, whereby Xen synchronises siblings
> > on vmexit/entry, so we are never executing code in two different
> > privilege levels.  Getting this working would make it safe to continue
> > using hyperthreading even in the presence of L1TF.  Obviously, its going
> > to come in perf hit, but compared to disabling hyperthreading, all its
> > got to do is beat a 60% perf hit to make it the preferable option for
> > making your system L1TF-proof.
>  Could you shed some light what tests were done where that 60%
>  performance hit was observed? We have performed intensive stress-tests
>  to confirm this but according to our findings turning off
>  hyper-threading is actually improving performance on all machines we
>  tested thus far.
> >>> Aggregate inter and intra host disk and network throughput, which is a
> >>> reasonable approximation of a load of webserver VM's on a single
> >>> physical server.  Small packet IO was hit worst, as it has a very high
> >>> vcpu context switch rate between dom0 and domU.  Disabling HT means you
> >>> have half the number of logical cores to schedule on, which doubles the
> >>> mean time to next timeslice.
> >>>
> >>> In principle, for a fully optimised workload, HT gets you ~30% extra due
> >>> to increased utilisation of the pipeline functional units.  Some
> >>> resources are statically partitioned, while some are competitively
> >>> shared, and its now been well proven that actions on one thread can have
> >>> a large effect on others.
> >>>
> >>> Two arbitrary vcpus are not an optimised workload.  If the perf
> >>> improvement you get from not competing in the pipeline is greater than
> >>> the perf loss from Xen's reduced capability to schedule, then disabling
> >>> HT would be an improvement.  I can certainly believe that this might be
> >>> the case for Qubes style workloads where you are probably not very
> >>> overprovisioned, and you probably don't have long running IO and CPU
> >>> bound tasks in the VMs.
> >> As another data point, I think it was MSCI who said they always disabled
> >> hyperthreading, because they also found that their workloads ran slower
> >> with HT than without.  Presumably they were doing massive number
> >> crunching, such that each thread was waiting on the ALU a significant
> >> portion of the time anyway; at which point the superscalar scheduling
> >> and/or reduction in cache efficiency would have brought performance from
> >> "no benefit" down to "negative benefit".
> >>
> > Thanks for the insights. Indeed, we are primarily concerned with
> > performance of Qubes-style workloads which may range from
> > no-oversubscription to heavily oversubscribed. It's not a workload we
> > can predict or optimize before-hand, so we are looking for a default
> > that would be 1) safe and 2) performant in the most general case
> > possible.
>
> So long as you've got the XSA-273 patches, you should be able to park
> and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.
>
> You should be able to effectively change hyperthreading configuration at
> runtime.  It's not quite the same as changing it in the BIOS, but from a
> competition of pipeline resources, it should be good enough.
>

Thanks, indeed that is a handy tool to have. We often can't disable
hyperthreading in the BIOS anyway because most BIOS' don't allow you
to do that when TXT is used. That said, with this tool we still
require some way to determine when to do parking/reactivation of
hyperthreads. We could certainly park hyperthreads when we see the
system is being oversubscribed in terms of number of vCPUs being
active, but for real optimization we would have to understand the
workloads running within the VMs if I understand correctly?

Tamas

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Andrew Cooper
On 25/10/18 18:35, Tamas K Lengyel wrote:
> On Thu, Oct 25, 2018 at 11:02 AM George Dunlap  
> wrote:
>> On 10/25/2018 05:55 PM, Andrew Cooper wrote:
>>> On 24/10/18 16:24, Tamas K Lengyel wrote:
> A solution to this issue was proposed, whereby Xen synchronises siblings
> on vmexit/entry, so we are never executing code in two different
> privilege levels.  Getting this working would make it safe to continue
> using hyperthreading even in the presence of L1TF.  Obviously, its going
> to come in perf hit, but compared to disabling hyperthreading, all its
> got to do is beat a 60% perf hit to make it the preferable option for
> making your system L1TF-proof.
 Could you shed some light what tests were done where that 60%
 performance hit was observed? We have performed intensive stress-tests
 to confirm this but according to our findings turning off
 hyper-threading is actually improving performance on all machines we
 tested thus far.
>>> Aggregate inter and intra host disk and network throughput, which is a
>>> reasonable approximation of a load of webserver VM's on a single
>>> physical server.  Small packet IO was hit worst, as it has a very high
>>> vcpu context switch rate between dom0 and domU.  Disabling HT means you
>>> have half the number of logical cores to schedule on, which doubles the
>>> mean time to next timeslice.
>>>
>>> In principle, for a fully optimised workload, HT gets you ~30% extra due
>>> to increased utilisation of the pipeline functional units.  Some
>>> resources are statically partitioned, while some are competitively
>>> shared, and its now been well proven that actions on one thread can have
>>> a large effect on others.
>>>
>>> Two arbitrary vcpus are not an optimised workload.  If the perf
>>> improvement you get from not competing in the pipeline is greater than
>>> the perf loss from Xen's reduced capability to schedule, then disabling
>>> HT would be an improvement.  I can certainly believe that this might be
>>> the case for Qubes style workloads where you are probably not very
>>> overprovisioned, and you probably don't have long running IO and CPU
>>> bound tasks in the VMs.
>> As another data point, I think it was MSCI who said they always disabled
>> hyperthreading, because they also found that their workloads ran slower
>> with HT than without.  Presumably they were doing massive number
>> crunching, such that each thread was waiting on the ALU a significant
>> portion of the time anyway; at which point the superscalar scheduling
>> and/or reduction in cache efficiency would have brought performance from
>> "no benefit" down to "negative benefit".
>>
> Thanks for the insights. Indeed, we are primarily concerned with
> performance of Qubes-style workloads which may range from
> no-oversubscription to heavily oversubscribed. It's not a workload we
> can predict or optimize before-hand, so we are looking for a default
> that would be 1) safe and 2) performant in the most general case
> possible.

So long as you've got the XSA-273 patches, you should be able to park
and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} $CPU`.

You should be able to effectively change hyperthreading configuration at
runtime.  It's not quite the same as changing it in the BIOS, but from a
competition of pipeline resources, it should be good enough.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Tamas K Lengyel
On Thu, Oct 25, 2018 at 11:02 AM George Dunlap  wrote:
>
> On 10/25/2018 05:55 PM, Andrew Cooper wrote:
> > On 24/10/18 16:24, Tamas K Lengyel wrote:
> >>> A solution to this issue was proposed, whereby Xen synchronises siblings
> >>> on vmexit/entry, so we are never executing code in two different
> >>> privilege levels.  Getting this working would make it safe to continue
> >>> using hyperthreading even in the presence of L1TF.  Obviously, its going
> >>> to come in perf hit, but compared to disabling hyperthreading, all its
> >>> got to do is beat a 60% perf hit to make it the preferable option for
> >>> making your system L1TF-proof.
> >> Could you shed some light what tests were done where that 60%
> >> performance hit was observed? We have performed intensive stress-tests
> >> to confirm this but according to our findings turning off
> >> hyper-threading is actually improving performance on all machines we
> >> tested thus far.
> >
> > Aggregate inter and intra host disk and network throughput, which is a
> > reasonable approximation of a load of webserver VM's on a single
> > physical server.  Small packet IO was hit worst, as it has a very high
> > vcpu context switch rate between dom0 and domU.  Disabling HT means you
> > have half the number of logical cores to schedule on, which doubles the
> > mean time to next timeslice.
> >
> > In principle, for a fully optimised workload, HT gets you ~30% extra due
> > to increased utilisation of the pipeline functional units.  Some
> > resources are statically partitioned, while some are competitively
> > shared, and its now been well proven that actions on one thread can have
> > a large effect on others.
> >
> > Two arbitrary vcpus are not an optimised workload.  If the perf
> > improvement you get from not competing in the pipeline is greater than
> > the perf loss from Xen's reduced capability to schedule, then disabling
> > HT would be an improvement.  I can certainly believe that this might be
> > the case for Qubes style workloads where you are probably not very
> > overprovisioned, and you probably don't have long running IO and CPU
> > bound tasks in the VMs.
>
> As another data point, I think it was MSCI who said they always disabled
> hyperthreading, because they also found that their workloads ran slower
> with HT than without.  Presumably they were doing massive number
> crunching, such that each thread was waiting on the ALU a significant
> portion of the time anyway; at which point the superscalar scheduling
> and/or reduction in cache efficiency would have brought performance from
> "no benefit" down to "negative benefit".
>

Thanks for the insights. Indeed, we are primarily concerned with
performance of Qubes-style workloads which may range from
no-oversubscription to heavily oversubscribed. It's not a workload we
can predict or optimize before-hand, so we are looking for a default
that would be 1) safe and 2) performant in the most general case
possible.

Tamas

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Dario Faggioli
On Thu, 2018-10-25 at 10:25 -0600, Tamas K Lengyel wrote:
> On Thu, Oct 25, 2018 at 10:01 AM Dario Faggioli 
> wrote:
> > 
> > Which is indeed very interesting. But, as we're discussing in the
> > other
> > thread, I would, in your case, do some more measurements, varying
> > the
> > configuration of the system, in order to be absolutely sure you are
> > not
> > hitting some bug or anomaly.
> 
> Sure, I would be happy to repeat tests that were done in the past to
> see whether they are still holding. We have run this test with Xen
> 4.10, 4.11 and 4.12-unstable on laptops and desktops, using credit1
> and credit2, and it is consistent that hyperthreading yields the
> worst
> performance. 
>
So, just to be clear, I'm not saying it's impossible to find a workload
for which HT is detrimental. Quite the opposite. And these benchmarks
you're running might well fall into that category.

I'm just suggesting to double check that. :-)

> It varies between platforms but it's around 10-40%
> performance hit with hyperthread on. This test we do is a very CPU
> intensive test where we heavily oversubscribe the system. But I don't
> think it would be all that unusual to run into such a setup in the
> real world from time-to-time.
> 
Ah, ok, so you're _heavily_ oversubscribing...

So, I don't think that an heavily oversubscribed host, where all vCPUs
would want to run 100% CPU intensive activities --and this not being
some transient situation-- is that common. And for the ones for which
it is, there is not much we can do, hyperthreading or not.

In any case, hyperthreading works best when the workload is mixed,
where it helps making sure that IO-bound tasks have enough chances to
file a lot of IO requests, without conflicting too much with the CPU-
bound tasks doing their number/logic crunching.

Having _everyone_ wanting to do actual stuff on the CPUs is, IMO, one
of the worst workloads for hyperthreading, and it is in fact a workload
where I've always seen it having the least beneficial effect on
performance. I guess it's possible that, in your case, it's actually
really doing more harm than good.

It's an interesting data point, but I wouldn't use a workload like that
to measure the benefit, or the impact, of an SMT related change.

Regards,
Dario
-- 
<> (Raistlin Majere)
-
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/


signature.asc
Description: This is a digitally signed message part
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread George Dunlap
On 10/25/2018 05:50 PM, Andrew Cooper wrote:
> On 25/10/18 17:43, George Dunlap wrote:
>> On 10/25/2018 05:29 PM, Andrew Cooper wrote:
>>> On 25/10/18 16:02, Jan Beulich wrote:
>>> On 25.10.18 at 16:56,  wrote:
> On 10/25/2018 03:50 PM, Jan Beulich wrote:
> On 22.10.18 at 16:55,  wrote:
>>> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
 An easy first step here is to remove Xen's directmap, which will mean
 that guests general RAM isn't mapped by default into Xen's address
 space.  This will come with some performance hit, as the
 map_domain_page() infrastructure will now have to actually
 create/destroy mappings, but removing the directmap will cause an
 improvement for non-speculative security as well (No possibility of
 ret2dir as an exploit technique).
>>> I have looked into making the "separate xenheap domheap with partial
>>> direct map" mode (see common/page_alloc.c) work but found it not as
>>> straight forward as it should've been.
>>>
>>> Before I spend more time on this, I would like some opinions on if there
>>> is other approach which might be more useful than that mode.
>> How would such a split heap model help with L1TF, where the
>> guest specifies host physical addresses in its vulnerable page
>> table entries
> I don't think it would.
>
>> (and hence could spy at xenheap but - due to not
>> being mapped - not domheap)?
> Er, didn't follow this bit -- if L1TF is related to host physical
> addresses, how does having a virtual mapping in Xen affect things in any
> way?
 Hmm, indeed. Scratch that part.
>>> There seems to be quite a bit of confusion in these replies.
>>>
>>> To exploit L1TF, the data in question has to be present in the L1 cache
>>> when the attack is performed.
>>>
>>> In practice, an attacker has to arrange for target data to be resident
>>> in the L1 cache.  One way it can do this when HT is enabled is via a
>>> cache-load gadget such as the first half of an SP1 attack on the other
>>> hyperthread.  A different way mechanism is to try and cause Xen to
>>> speculatively access a piece of data, and have the hardware prefetch
>>> bring it into the cache.
>> Right -- so a split xen/domheap model doesn't prevent L1TF attacks, but
>> it does make L1TF much harder to pull off, because it now only works if
>> you can manage to get onto the same core as the victim, after the victim
>> has accessed the data you want.
>>
>> So it would reduce the risk of L1TF significantly, but not enough (I
>> think) that we could recommend disabling other mitigations.
> 
> Correct.  All of these suggestions are for increased defence in depth. 
> They are not replacements for the existing mitigations.

But it could be a mitigation for, say, Meltdown, yes?  I'm trying to
remember the details; but wouldn't a "secret-free Xen" mean that
disabling XPTI entirely for 64-bit PV guests would be a reasonable
decision (even if many people left it enabled 'just in case')?

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Andrew Cooper
On 24/10/18 16:24, Tamas K Lengyel wrote:
>> A solution to this issue was proposed, whereby Xen synchronises siblings
>> on vmexit/entry, so we are never executing code in two different
>> privilege levels.  Getting this working would make it safe to continue
>> using hyperthreading even in the presence of L1TF.  Obviously, its going
>> to come in perf hit, but compared to disabling hyperthreading, all its
>> got to do is beat a 60% perf hit to make it the preferable option for
>> making your system L1TF-proof.
> Could you shed some light what tests were done where that 60%
> performance hit was observed? We have performed intensive stress-tests
> to confirm this but according to our findings turning off
> hyper-threading is actually improving performance on all machines we
> tested thus far.

Aggregate inter and intra host disk and network throughput, which is a
reasonable approximation of a load of webserver VM's on a single
physical server.  Small packet IO was hit worst, as it has a very high
vcpu context switch rate between dom0 and domU.  Disabling HT means you
have half the number of logical cores to schedule on, which doubles the
mean time to next timeslice.

In principle, for a fully optimised workload, HT gets you ~30% extra due
to increased utilisation of the pipeline functional units.  Some
resources are statically partitioned, while some are competitively
shared, and its now been well proven that actions on one thread can have
a large effect on others.

Two arbitrary vcpus are not an optimised workload.  If the perf
improvement you get from not competing in the pipeline is greater than
the perf loss from Xen's reduced capability to schedule, then disabling
HT would be an improvement.  I can certainly believe that this might be
the case for Qubes style workloads where you are probably not very
overprovisioned, and you probably don't have long running IO and CPU
bound tasks in the VMs.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Andrew Cooper
On 25/10/18 17:43, George Dunlap wrote:
> On 10/25/2018 05:29 PM, Andrew Cooper wrote:
>> On 25/10/18 16:02, Jan Beulich wrote:
>> On 25.10.18 at 16:56,  wrote:
 On 10/25/2018 03:50 PM, Jan Beulich wrote:
 On 22.10.18 at 16:55,  wrote:
>> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
>>> An easy first step here is to remove Xen's directmap, which will mean
>>> that guests general RAM isn't mapped by default into Xen's address
>>> space.  This will come with some performance hit, as the
>>> map_domain_page() infrastructure will now have to actually
>>> create/destroy mappings, but removing the directmap will cause an
>>> improvement for non-speculative security as well (No possibility of
>>> ret2dir as an exploit technique).
>> I have looked into making the "separate xenheap domheap with partial
>> direct map" mode (see common/page_alloc.c) work but found it not as
>> straight forward as it should've been.
>>
>> Before I spend more time on this, I would like some opinions on if there
>> is other approach which might be more useful than that mode.
> How would such a split heap model help with L1TF, where the
> guest specifies host physical addresses in its vulnerable page
> table entries
 I don't think it would.

> (and hence could spy at xenheap but - due to not
> being mapped - not domheap)?
 Er, didn't follow this bit -- if L1TF is related to host physical
 addresses, how does having a virtual mapping in Xen affect things in any
 way?
>>> Hmm, indeed. Scratch that part.
>> There seems to be quite a bit of confusion in these replies.
>>
>> To exploit L1TF, the data in question has to be present in the L1 cache
>> when the attack is performed.
>>
>> In practice, an attacker has to arrange for target data to be resident
>> in the L1 cache.  One way it can do this when HT is enabled is via a
>> cache-load gadget such as the first half of an SP1 attack on the other
>> hyperthread.  A different way mechanism is to try and cause Xen to
>> speculatively access a piece of data, and have the hardware prefetch
>> bring it into the cache.
> Right -- so a split xen/domheap model doesn't prevent L1TF attacks, but
> it does make L1TF much harder to pull off, because it now only works if
> you can manage to get onto the same core as the victim, after the victim
> has accessed the data you want.
>
> So it would reduce the risk of L1TF significantly, but not enough (I
> think) that we could recommend disabling other mitigations.

Correct.  All of these suggestions are for increased defence in depth. 
They are not replacements for the existing mitigations.

From a practical point of view, until people work out how to
comprehensively solve SP1, reducing the quantity of mapped data is the
only practical defence that an OS/Hypervisor has.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread George Dunlap
On 10/25/2018 05:29 PM, Andrew Cooper wrote:
> On 25/10/18 16:02, Jan Beulich wrote:
> On 25.10.18 at 16:56,  wrote:
>>> On 10/25/2018 03:50 PM, Jan Beulich wrote:
>>> On 22.10.18 at 16:55,  wrote:
> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
>> An easy first step here is to remove Xen's directmap, which will mean
>> that guests general RAM isn't mapped by default into Xen's address
>> space.  This will come with some performance hit, as the
>> map_domain_page() infrastructure will now have to actually
>> create/destroy mappings, but removing the directmap will cause an
>> improvement for non-speculative security as well (No possibility of
>> ret2dir as an exploit technique).
> I have looked into making the "separate xenheap domheap with partial
> direct map" mode (see common/page_alloc.c) work but found it not as
> straight forward as it should've been.
>
> Before I spend more time on this, I would like some opinions on if there
> is other approach which might be more useful than that mode.
 How would such a split heap model help with L1TF, where the
 guest specifies host physical addresses in its vulnerable page
 table entries
>>> I don't think it would.
>>>
 (and hence could spy at xenheap but - due to not
 being mapped - not domheap)?
>>> Er, didn't follow this bit -- if L1TF is related to host physical
>>> addresses, how does having a virtual mapping in Xen affect things in any
>>> way?
>> Hmm, indeed. Scratch that part.
> 
> There seems to be quite a bit of confusion in these replies.
> 
> To exploit L1TF, the data in question has to be present in the L1 cache
> when the attack is performed.
> 
> In practice, an attacker has to arrange for target data to be resident
> in the L1 cache.  One way it can do this when HT is enabled is via a
> cache-load gadget such as the first half of an SP1 attack on the other
> hyperthread.  A different way mechanism is to try and cause Xen to
> speculatively access a piece of data, and have the hardware prefetch
> bring it into the cache.

Right -- so a split xen/domheap model doesn't prevent L1TF attacks, but
it does make L1TF much harder to pull off, because it now only works if
you can manage to get onto the same core as the victim, after the victim
has accessed the data you want.

So it would reduce the risk of L1TF significantly, but not enough (I
think) that we could recommend disabling other mitigations.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Andrew Cooper
On 25/10/18 16:02, Jan Beulich wrote:
 On 25.10.18 at 16:56,  wrote:
>> On 10/25/2018 03:50 PM, Jan Beulich wrote:
>> On 22.10.18 at 16:55,  wrote:
 On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
> An easy first step here is to remove Xen's directmap, which will mean
> that guests general RAM isn't mapped by default into Xen's address
> space.  This will come with some performance hit, as the
> map_domain_page() infrastructure will now have to actually
> create/destroy mappings, but removing the directmap will cause an
> improvement for non-speculative security as well (No possibility of
> ret2dir as an exploit technique).
 I have looked into making the "separate xenheap domheap with partial
 direct map" mode (see common/page_alloc.c) work but found it not as
 straight forward as it should've been.

 Before I spend more time on this, I would like some opinions on if there
 is other approach which might be more useful than that mode.
>>> How would such a split heap model help with L1TF, where the
>>> guest specifies host physical addresses in its vulnerable page
>>> table entries
>> I don't think it would.
>>
>>> (and hence could spy at xenheap but - due to not
>>> being mapped - not domheap)?
>> Er, didn't follow this bit -- if L1TF is related to host physical
>> addresses, how does having a virtual mapping in Xen affect things in any
>> way?
> Hmm, indeed. Scratch that part.

There seems to be quite a bit of confusion in these replies.

To exploit L1TF, the data in question has to be present in the L1 cache
when the attack is performed.

In practice, an attacker has to arrange for target data to be resident
in the L1 cache.  One way it can do this when HT is enabled is via a
cache-load gadget such as the first half of an SP1 attack on the other
hyperthread.  A different way mechanism is to try and cause Xen to
speculatively access a piece of data, and have the hardware prefetch
bring it into the cache.

Everything which is virtually mapped in Xen is potentially vulnerable,
and the goal of the "secret-free Xen" is to make sure that in the
context of one vcpu pulling off an attack like this, there is no
interesting data which can be exfiltrated.

A single xenheap model means that everything allocated with
alloc_xenheap_page() (e.g. struct domain, struct vcpu, pcpu stacks) are
potentially exposed to all domains.

A split xenheap model means that data pertaining to other guests isn't
mapped in the context of this vcpu, so cannot be brought into the cache.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Jan Beulich
>>> On 25.10.18 at 16:56,  wrote:
> On 10/25/2018 03:50 PM, Jan Beulich wrote:
> On 22.10.18 at 16:55,  wrote:
>>> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
 An easy first step here is to remove Xen's directmap, which will mean
 that guests general RAM isn't mapped by default into Xen's address
 space.  This will come with some performance hit, as the
 map_domain_page() infrastructure will now have to actually
 create/destroy mappings, but removing the directmap will cause an
 improvement for non-speculative security as well (No possibility of
 ret2dir as an exploit technique).
>>>
>>> I have looked into making the "separate xenheap domheap with partial
>>> direct map" mode (see common/page_alloc.c) work but found it not as
>>> straight forward as it should've been.
>>>
>>> Before I spend more time on this, I would like some opinions on if there
>>> is other approach which might be more useful than that mode.
>> 
>> How would such a split heap model help with L1TF, where the
>> guest specifies host physical addresses in its vulnerable page
>> table entries
> 
> I don't think it would.
> 
>> (and hence could spy at xenheap but - due to not
>> being mapped - not domheap)?
> 
> Er, didn't follow this bit -- if L1TF is related to host physical
> addresses, how does having a virtual mapping in Xen affect things in any
> way?

Hmm, indeed. Scratch that part.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread George Dunlap
On 10/25/2018 03:50 PM, Jan Beulich wrote:
 On 22.10.18 at 16:55,  wrote:
>> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
>>> An easy first step here is to remove Xen's directmap, which will mean
>>> that guests general RAM isn't mapped by default into Xen's address
>>> space.  This will come with some performance hit, as the
>>> map_domain_page() infrastructure will now have to actually
>>> create/destroy mappings, but removing the directmap will cause an
>>> improvement for non-speculative security as well (No possibility of
>>> ret2dir as an exploit technique).
>>
>> I have looked into making the "separate xenheap domheap with partial
>> direct map" mode (see common/page_alloc.c) work but found it not as
>> straight forward as it should've been.
>>
>> Before I spend more time on this, I would like some opinions on if there
>> is other approach which might be more useful than that mode.
> 
> How would such a split heap model help with L1TF, where the
> guest specifies host physical addresses in its vulnerable page
> table entries

I don't think it would.

> (and hence could spy at xenheap but - due to not
> being mapped - not domheap)?

Er, didn't follow this bit -- if L1TF is related to host physical
addresses, how does having a virtual mapping in Xen affect things in any
way?

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-25 Thread Jan Beulich
>>> On 22.10.18 at 16:55,  wrote:
> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
>> An easy first step here is to remove Xen's directmap, which will mean
>> that guests general RAM isn't mapped by default into Xen's address
>> space.  This will come with some performance hit, as the
>> map_domain_page() infrastructure will now have to actually
>> create/destroy mappings, but removing the directmap will cause an
>> improvement for non-speculative security as well (No possibility of
>> ret2dir as an exploit technique).
> 
> I have looked into making the "separate xenheap domheap with partial
> direct map" mode (see common/page_alloc.c) work but found it not as
> straight forward as it should've been.
> 
> Before I spend more time on this, I would like some opinions on if there
> is other approach which might be more useful than that mode.

How would such a split heap model help with L1TF, where the
guest specifies host physical addresses in its vulnerable page
table entries (and hence could spy at xenheap but - due to not
being mapped - not domheap)?

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-24 Thread Tamas K Lengyel
> A solution to this issue was proposed, whereby Xen synchronises siblings
> on vmexit/entry, so we are never executing code in two different
> privilege levels.  Getting this working would make it safe to continue
> using hyperthreading even in the presence of L1TF.  Obviously, its going
> to come in perf hit, but compared to disabling hyperthreading, all its
> got to do is beat a 60% perf hit to make it the preferable option for
> making your system L1TF-proof.

Could you shed some light what tests were done where that 60%
performance hit was observed? We have performed intensive stress-tests
to confirm this but according to our findings turning off
hyper-threading is actually improving performance on all machines we
tested thus far.

Thanks,
Tamas

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-22 Thread Andrew Cooper
On 22/10/18 16:09, Woodhouse, David wrote:
> Adding Stefan to Cc.
>
> Should we take this to the spexen or another mailing list?

Now that L1TF is public, so is all of this.  I see no reason to continue
it in private.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-22 Thread Woodhouse, David
Adding Stefan to Cc.

Should we take this to the spexen or another mailing list?


On Mon, 2018-10-22 at 15:55 +0100, Wei Liu wrote:
> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
> > Hello,
> > 
> > This is an accumulation and summary of various tasks which have been
> > discussed since the revelation of the speculative security issues in
> > January, and also an invitation to discuss alternative ideas.  They are
> > x86 specific, but a lot of the principles are architecture-agnostic.
> > 
> > 1) A secrets-free hypervisor.
> > 
> > Basically every hypercall can be (ab)used by a guest, and used as an
> > arbitrary cache-load gadget.  Logically, this is the first half of a
> > Spectre SP1 gadget, and is usually the first stepping stone to
> > exploiting one of the speculative sidechannels.
> > 
> > Short of compiling Xen with LLVM's Speculative Load Hardening (which is
> > still experimental, and comes with a ~30% perf hit in the common case),
> > this is unavoidable.  Furthermore, throwing a few array_index_nospec()
> > into the code isn't a viable solution to the problem.
> > 
> > An alternative option is to have less data mapped into Xen's virtual
> > address space - if a piece of memory isn't mapped, it can't be loaded
> > into the cache.
> > 
> > An easy first step here is to remove Xen's directmap, which will mean
> > that guests general RAM isn't mapped by default into Xen's address
> > space.  This will come with some performance hit, as the
> > map_domain_page() infrastructure will now have to actually
> > create/destroy mappings, but removing the directmap will cause an
> > improvement for non-speculative security as well (No possibility of
> > ret2dir as an exploit technique).
> 
> I have looked into making the "separate xenheap domheap with partial
> direct map" mode (see common/page_alloc.c) work but found it not as
> straight forward as it should've been.
> 
> Before I spend more time on this, I would like some opinions on if there
> is other approach which might be more useful than that mode.
> 
> > 
> > Beyond the directmap, there are plenty of other interesting secrets in
> > the Xen heap and other mappings, such as the stacks of the other pcpus. 
> > Fixing this requires moving Xen to having a non-uniform memory layout,
> > and this is much harder to change.  I already experimented with this as
> > a meltdown mitigation around about a year ago, and posted the resulting
> > series on Jan 4th,
> > https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
> > some trivial bits of which have already found their way upstream.
> > 
> > To have a non-uniform memory layout, Xen may not share L4 pagetables. 
> > i.e. Xen must never have two pcpus which reference the same pagetable in
> > %cr3.
> > 
> > This property already holds for 32bit PV guests, and all HVM guests, but
> > 64bit PV guests are the sticking point.  Because Linux has a flat memory
> > layout, when a 64bit PV guest schedules two threads from the same
> > process on separate vcpus, those two vcpus have the same virtual %cr3,
> > and currently, Xen programs the same real %cr3 into hardware.
> 
> Which bit of Linux code are you referring to? If you remember it off the
> top of your head, it would save me some time digging around. If not,
> never mind, I can look it up myself.
> 
> > 
> > If we want Xen to have a non-uniform layout, are two options are:
> > * Fix Linux to have the same non-uniform layout that Xen wants
> > (Backwards compatibility for older 64bit PV guests can be achieved with
> > xen-shim).
> > * Make use XPTI algorithm (specifically, the pagetable sync/copy part)
> > forever more in the future.
> > 
> > Option 2 isn't great (especially for perf on fixed hardware), but does
> > keep all the necessary changes in Xen.  Option 1 looks to be the better
> > option longterm.
> 
> What is the problem with 1+2 at the same time? I think XPTI can be
> enabled / disabled on a per-guest basis?
> 
> Wei.



smime.p7s
Description: S/MIME cryptographic signature



Amazon Development Centre (London) Ltd. Registered in England and Wales with 
registration number 04543232 with its registered office at 1 Principal Place, 
Worship Street, London EC2A 2FA, United Kingdom.


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-22 Thread Wei Liu
On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
> Hello,
> 
> This is an accumulation and summary of various tasks which have been
> discussed since the revelation of the speculative security issues in
> January, and also an invitation to discuss alternative ideas.  They are
> x86 specific, but a lot of the principles are architecture-agnostic.
> 
> 1) A secrets-free hypervisor.
> 
> Basically every hypercall can be (ab)used by a guest, and used as an
> arbitrary cache-load gadget.  Logically, this is the first half of a
> Spectre SP1 gadget, and is usually the first stepping stone to
> exploiting one of the speculative sidechannels.
> 
> Short of compiling Xen with LLVM's Speculative Load Hardening (which is
> still experimental, and comes with a ~30% perf hit in the common case),
> this is unavoidable.  Furthermore, throwing a few array_index_nospec()
> into the code isn't a viable solution to the problem.
> 
> An alternative option is to have less data mapped into Xen's virtual
> address space - if a piece of memory isn't mapped, it can't be loaded
> into the cache.
> 
> An easy first step here is to remove Xen's directmap, which will mean
> that guests general RAM isn't mapped by default into Xen's address
> space.  This will come with some performance hit, as the
> map_domain_page() infrastructure will now have to actually
> create/destroy mappings, but removing the directmap will cause an
> improvement for non-speculative security as well (No possibility of
> ret2dir as an exploit technique).

I have looked into making the "separate xenheap domheap with partial
direct map" mode (see common/page_alloc.c) work but found it not as
straight forward as it should've been.

Before I spend more time on this, I would like some opinions on if there
is other approach which might be more useful than that mode.

> 
> Beyond the directmap, there are plenty of other interesting secrets in
> the Xen heap and other mappings, such as the stacks of the other pcpus. 
> Fixing this requires moving Xen to having a non-uniform memory layout,
> and this is much harder to change.  I already experimented with this as
> a meltdown mitigation around about a year ago, and posted the resulting
> series on Jan 4th,
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
> some trivial bits of which have already found their way upstream.
> 
> To have a non-uniform memory layout, Xen may not share L4 pagetables. 
> i.e. Xen must never have two pcpus which reference the same pagetable in
> %cr3.
> 
> This property already holds for 32bit PV guests, and all HVM guests, but
> 64bit PV guests are the sticking point.  Because Linux has a flat memory
> layout, when a 64bit PV guest schedules two threads from the same
> process on separate vcpus, those two vcpus have the same virtual %cr3,
> and currently, Xen programs the same real %cr3 into hardware.

Which bit of Linux code are you referring to? If you remember it off the
top of your head, it would save me some time digging around. If not,
never mind, I can look it up myself.

> 
> If we want Xen to have a non-uniform layout, are two options are:
> * Fix Linux to have the same non-uniform layout that Xen wants
> (Backwards compatibility for older 64bit PV guests can be achieved with
> xen-shim).
> * Make use XPTI algorithm (specifically, the pagetable sync/copy part)
> forever more in the future.
> 
> Option 2 isn't great (especially for perf on fixed hardware), but does
> keep all the necessary changes in Xen.  Option 1 looks to be the better
> option longterm.

What is the problem with 1+2 at the same time? I think XPTI can be
enabled / disabled on a per-guest basis?

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-22 Thread Mihai Donțu
On Fri, 2018-10-19 at 13:17 +0100, Andrew Cooper wrote:
> [...]
> 
> > Therefore, although I certainly think we _must_ have the proper
> > scheduler enhancements in place (and in fact I'm working on that :-D)
> > it should IMO still be possible for the user to decide whether or not
> > to use them (either by opting-in or opting-out, I don't care much at
> > this stage).
> 
> I'm not suggesting that we leave people without a choice, but given an
> option which doesn't share siblings between different guests, it should
> be the default.

+1

> [...]
> 
> Its best to consider the secret-free Xen and scheduler improvements as
> orthogonal.  In particular, the secret-free Xen is defence in depth
> against SP1, and the risk of future issues, but does have
> non-speculative benefits as well.
> 
> That said, the only way to use HT and definitely be safe to L1TF without
> a secret-free Xen is to have the synchronised entry/exit logic working.
> 
> > > A solution to this issue was proposed, whereby Xen synchronises
> > > siblings on vmexit/entry, so we are never executing code in two different
> > > privilege levels.  Getting this working would make it safe to
> > > continue using hyperthreading even in the presence of L1TF.  
> > 
> > Err... ok, but we still want core-aware scheduling, or at least we want
> > to avoid having vcpus from different domains on siblings, don't we? In
> > order to avoid leaks between guests, I mean.
> 
> Ideally, we'd want all of these.  I expect the only reasonable way to
> develop them is one on top of another.

If there was a vote, I'd place the scheduler changes at the top.

-- 
Mihai Donțu


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-19 Thread Andrew Cooper
On 19/10/18 09:09, Dario Faggioli wrote:
> On Thu, 2018-10-18 at 18:46 +0100, Andrew Cooper wrote:
>> Hello,
>>
> Hey,
>
> This is very accurate and useful... thanks for it. :-)
>
>> 1) A secrets-free hypervisor.
>>
>> Basically every hypercall can be (ab)used by a guest, and used as an
>> arbitrary cache-load gadget.  Logically, this is the first half of a
>> Spectre SP1 gadget, and is usually the first stepping stone to
>> exploiting one of the speculative sidechannels.
>>
>> Short of compiling Xen with LLVM's Speculative Load Hardening (which
>> is
>> still experimental, and comes with a ~30% perf hit in the common
>> case),
>> this is unavoidable.  Furthermore, throwing a few
>> array_index_nospec()
>> into the code isn't a viable solution to the problem.
>>
>> An alternative option is to have less data mapped into Xen's virtual
>> address space - if a piece of memory isn't mapped, it can't be loaded
>> into the cache.
>>
>> [...]
>>
>> 2) Scheduler improvements.
>>
>> (I'm afraid this is rather more sparse because I'm less familiar with
>> the scheduler details.)
>>
>> At the moment, all of Xen's schedulers will happily put two vcpus
>> from
>> different domains on sibling hyperthreads.  There has been a lot of
>> sidechannel research over the past decade demonstrating ways for one
>> thread to infer what is going on the other, but L1TF is the first
>> vulnerability I'm aware of which allows one thread to directly read
>> data
>> out of the other.
>>
>> Either way, it is now definitely a bad thing to run different guests
>> concurrently on siblings.  
>>
> Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the
> first serious issues discovered so far and, for instance, even on x86,
> not all Intel CPUs and none of the AMD ones, AFAIK, are affected.

TLBleed is an excellent paper and associated research, but is still just
inference - a vast quantity of post-processing is required to extract
the key.

There are plenty of other sidechannels which affect all SMT
implementations, such as the effects of executing an mfence instruction,
execution unit

> Therefore, although I certainly think we _must_ have the proper
> scheduler enhancements in place (and in fact I'm working on that :-D)
> it should IMO still be possible for the user to decide whether or not
> to use them (either by opting-in or opting-out, I don't care much at
> this stage).

I'm not suggesting that we leave people without a choice, but given an
option which doesn't share siblings between different guests, it should
be the default.

>
>> Fixing this by simply not scheduling vcpus
>> from a different guest on siblings does result in a lower resource
>> utilisation, most notably when there are an odd number runable vcpus
>> in
>> a domain, as the other thread is forced to idle.
>>
> Right.
>
>> A step beyond this is core-aware scheduling, where we schedule in
>> units
>> of a virtual core rather than a virtual thread.  This has much better
>> behaviour from the guests point of view, as the actually-scheduled
>> topology remains consistent, but does potentially come with even
>> lower
>> utilisation if every other thread in the guest is idle.
>>
> Yes, basically, what you describe as 'core-aware scheduling' here can
> be build on top of what you had described above as 'not scheduling
> vcpus from different guests'.
>
> I mean, we can/should put ourselves in a position where the user can
> choose if he/she wants:
> - just 'plain scheduling', as we have now,
> - "just" that only vcpus of the same domains are scheduled on siblings
> hyperthread,
> - full 'core-aware scheduling', i.e., only vcpus that the guest
> actually sees as virtual hyperthread siblings, are scheduled on
> hardware hyperthread siblings.
>
> About the performance impact, indeed it's even higher with core-aware
> scheduling. Something that we can see about doing, is acting on the
> guest scheduler, e.g., telling it to try to "pack the load", and keep
> siblings busy, instead of trying to avoid doing that (which is what
> happens by default in most cases).
>
> In Linux, this can be done by playing with the sched-flags (see, e.g.,
> https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20
>  ,
> and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ).
>
> The idea would be to avoid, as much as possible, the case when "every
> other thread is idle in the guest". I'm not sure about being able to do
> something by default, but we can certainly document things (like "if
> you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags'
> in your Linux guests").
>
> I haven't checked whether other OSs' schedulers have something similar.
>
>> A side requirement for core-aware scheduling is for Xen to have an
>> accurate idea of the topology presented to the guest.  I need to dust
>> off my Toolstack CPUID/MSR improvement series and get that upstream.
>>
> Indeed. Without knowing which one of the guest's vcpus are to be
> considered virtual 

Re: [Xen-devel] Ongoing/future speculative mitigation work

2018-10-19 Thread Dario Faggioli
On Thu, 2018-10-18 at 18:46 +0100, Andrew Cooper wrote:
> Hello,
> 
Hey,

This is very accurate and useful... thanks for it. :-)

> 1) A secrets-free hypervisor.
> 
> Basically every hypercall can be (ab)used by a guest, and used as an
> arbitrary cache-load gadget.  Logically, this is the first half of a
> Spectre SP1 gadget, and is usually the first stepping stone to
> exploiting one of the speculative sidechannels.
> 
> Short of compiling Xen with LLVM's Speculative Load Hardening (which
> is
> still experimental, and comes with a ~30% perf hit in the common
> case),
> this is unavoidable.  Furthermore, throwing a few
> array_index_nospec()
> into the code isn't a viable solution to the problem.
> 
> An alternative option is to have less data mapped into Xen's virtual
> address space - if a piece of memory isn't mapped, it can't be loaded
> into the cache.
> 
> [...]
> 
> 2) Scheduler improvements.
> 
> (I'm afraid this is rather more sparse because I'm less familiar with
> the scheduler details.)
> 
> At the moment, all of Xen's schedulers will happily put two vcpus
> from
> different domains on sibling hyperthreads.  There has been a lot of
> sidechannel research over the past decade demonstrating ways for one
> thread to infer what is going on the other, but L1TF is the first
> vulnerability I'm aware of which allows one thread to directly read
> data
> out of the other.
> 
> Either way, it is now definitely a bad thing to run different guests
> concurrently on siblings.  
>
Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the
first serious issues discovered so far and, for instance, even on x86,
not all Intel CPUs and none of the AMD ones, AFAIK, are affected.

Therefore, although I certainly think we _must_ have the proper
scheduler enhancements in place (and in fact I'm working on that :-D)
it should IMO still be possible for the user to decide whether or not
to use them (either by opting-in or opting-out, I don't care much at
this stage).

> Fixing this by simply not scheduling vcpus
> from a different guest on siblings does result in a lower resource
> utilisation, most notably when there are an odd number runable vcpus
> in
> a domain, as the other thread is forced to idle.
> 
Right.

> A step beyond this is core-aware scheduling, where we schedule in
> units
> of a virtual core rather than a virtual thread.  This has much better
> behaviour from the guests point of view, as the actually-scheduled
> topology remains consistent, but does potentially come with even
> lower
> utilisation if every other thread in the guest is idle.
> 
Yes, basically, what you describe as 'core-aware scheduling' here can
be build on top of what you had described above as 'not scheduling
vcpus from different guests'.

I mean, we can/should put ourselves in a position where the user can
choose if he/she wants:
- just 'plain scheduling', as we have now,
- "just" that only vcpus of the same domains are scheduled on siblings
hyperthread,
- full 'core-aware scheduling', i.e., only vcpus that the guest
actually sees as virtual hyperthread siblings, are scheduled on
hardware hyperthread siblings.

About the performance impact, indeed it's even higher with core-aware
scheduling. Something that we can see about doing, is acting on the
guest scheduler, e.g., telling it to try to "pack the load", and keep
siblings busy, instead of trying to avoid doing that (which is what
happens by default in most cases).

In Linux, this can be done by playing with the sched-flags (see, e.g.,
https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20
 ,
and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ).

The idea would be to avoid, as much as possible, the case when "every
other thread is idle in the guest". I'm not sure about being able to do
something by default, but we can certainly document things (like "if
you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags'
in your Linux guests").

I haven't checked whether other OSs' schedulers have something similar.

> A side requirement for core-aware scheduling is for Xen to have an
> accurate idea of the topology presented to the guest.  I need to dust
> off my Toolstack CPUID/MSR improvement series and get that upstream.
> 
Indeed. Without knowing which one of the guest's vcpus are to be
considered virtual hyperthread siblings, I can only get you as far as
"only scheduling vcpus of the same domain on siblings hyperthread". :-)

> One of the most insidious problems with L1TF is that, with
> hyperthreading enabled, a malicious guest kernel can engineer
> arbitrary
> data leakage by having one thread scanning the expected physical
> address, and the other thread using an arbitrary cache-load gadget in
> hypervisor context.  This occurs because the L1 data cache is shared
> by
> threads.
>
Right. So, sorry if this is a stupid question, but how does this relate
to the "secret-free hypervisor", and with the "if a piece of memory
isn't 

[Xen-devel] Ongoing/future speculative mitigation work

2018-10-18 Thread Andrew Cooper
Hello,

This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas.  They are
x86 specific, but a lot of the principles are architecture-agnostic.

1) A secrets-free hypervisor.

Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget.  Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.

Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable.  Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.

An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.

An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space.  This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).

Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus. 
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change.  I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
some trivial bits of which have already found their way upstream.

To have a non-uniform memory layout, Xen may not share L4 pagetables. 
i.e. Xen must never have two pcpus which reference the same pagetable in
%cr3.

This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point.  Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.

If we want Xen to have a non-uniform layout, are two options are:
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
xen-shim).
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.

Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen.  Option 1 looks to be the better
option longterm.

As an interesting point to note.  The 32bit PV ABI prohibits sharing of
L3 pagetables, because back in the 32bit hypervisor days, we used to
have linear mappings in the Xen virtual range.  This check is stale
(from a functionality point of view), but still present in Xen.  A
consequence of this is that 32bit PV guests definitely don't share
top-level pagetables across vcpus.

Juergen/Boris: Do you have any idea if/how easy this infrastructure
would be to implement for 64bit PV guests as well?  If a PV guest can
advertise via Elfnote that it won't share top-level pagetables, then we
can audit this trivially in Xen.


2) Scheduler improvements.

(I'm afraid this is rather more sparse because I'm less familiar with
the scheduler details.)

At the moment, all of Xen's schedulers will happily put two vcpus from
different domains on sibling hyperthreads.  There has been a lot of
sidechannel research over the past decade demonstrating ways for one
thread to infer what is going on the other, but L1TF is the first
vulnerability I'm aware of which allows one thread to directly read data
out of the other.

Either way, it is now definitely a bad thing to run different guests
concurrently on siblings.  Fixing this by simply not scheduling vcpus
from a different guest on siblings does result in a lower resource
utilisation, most notably when there are an odd number runable vcpus in
a domain, as the other thread is forced to idle.

A step beyond this is core-aware scheduling, where we schedule in units
of a virtual core rather than a virtual thread.  This has much better
behaviour from the guests point of view, as the actually-scheduled
topology remains consistent, but does potentially come with even lower
utilisation if every other thread in the guest is idle.

A side requirement for core-aware scheduling is for Xen to have an
accurate idea of the topology presented to the guest.  I need to dust
off my Toolstack CPUID/MSR improvement series and get that upstream.

One of the most insidious problems with L1TF is that, with
hyperthreading enabled, a malicious guest