Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-10-24 Thread Tycho Andersen
On Wed, Oct 24, 2018 at 04:30:42PM +0530, Khalid Aziz wrote:
> On 10/15/2018 01:37 PM, Khalid Aziz wrote:
> > On 09/24/2018 08:45 AM, Stecklina, Julian wrote:
> > > I didn't test the version with TLB flushes, because it's clear that the
> > > overhead is so bad that no one wants to use this.
> > 
> > I don't think we can ignore the vulnerability caused by not flushing
> > stale TLB entries. On a mostly idle system, TLB entries hang around long
> > enough to make it fairly easy to exploit this. I was able to use the
> > additional test in lkdtm module added by this patch series to
> > successfully read pages unmapped from physmap by just waiting for system
> > to become idle. A rogue program can simply monitor system load and mount
> > its attack using ret2dir exploit when system is mostly idle. This brings
> > us back to the prohibitive cost of TLB flushes. If we are unmapping a
> > page from physmap every time the page is allocated to userspace, we are
> > forced to incur the cost of TLB flushes in some way. Work Tycho was
> > doing to implement Dave's suggestion can help here. Once Tycho has
> > something working, I can measure overhead on my test machine. Tycho, I
> > can help with your implementation if you need.
> 
> I looked at Tycho's last patch with batch update from
> . I ported it on top of Julian's
> patches and got it working well enough to gather performance numbers. Here
> is what I see for system times on a machine with dual Xeon E5-2630 and 256GB
> of memory when running "make -j30 all" on 4.18.6 kernel (percentages are
> relative to base 4.19-rc8 kernel without xpfo):
> 
> 
> Base 4.19-rc8 913.84s
> 4.19-rc8 + xpfo, no TLB flush 1027.985s (+12.5%)
> 4.19-rc8 + batch update, no TLB flush 970.39s (+6.2%)
> 4.19-rc8 + xpfo, TLB flush8458.449s (+825.6%)
> 4.19-rc8 + batch update, TLB flush4665.659s (+410.6%)
> 
> Batch update is significant improvement but we are starting so far behind
> baseline, it is still a huge slow down.

There's some other stuff that Dave suggested that I didn't do; in
particular coalesce xpfo bits instead of setting things once per page
when mappings are shared, etc.

Perhaps that will help more?

I'm still stuck working on something else for now, but I hope to be
able to participate more on this Soon (TM). Thanks for the testing!

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-10-24 Thread Tycho Andersen
On Wed, Oct 24, 2018 at 04:30:42PM +0530, Khalid Aziz wrote:
> On 10/15/2018 01:37 PM, Khalid Aziz wrote:
> > On 09/24/2018 08:45 AM, Stecklina, Julian wrote:
> > > I didn't test the version with TLB flushes, because it's clear that the
> > > overhead is so bad that no one wants to use this.
> > 
> > I don't think we can ignore the vulnerability caused by not flushing
> > stale TLB entries. On a mostly idle system, TLB entries hang around long
> > enough to make it fairly easy to exploit this. I was able to use the
> > additional test in lkdtm module added by this patch series to
> > successfully read pages unmapped from physmap by just waiting for system
> > to become idle. A rogue program can simply monitor system load and mount
> > its attack using ret2dir exploit when system is mostly idle. This brings
> > us back to the prohibitive cost of TLB flushes. If we are unmapping a
> > page from physmap every time the page is allocated to userspace, we are
> > forced to incur the cost of TLB flushes in some way. Work Tycho was
> > doing to implement Dave's suggestion can help here. Once Tycho has
> > something working, I can measure overhead on my test machine. Tycho, I
> > can help with your implementation if you need.
> 
> I looked at Tycho's last patch with batch update from
> . I ported it on top of Julian's
> patches and got it working well enough to gather performance numbers. Here
> is what I see for system times on a machine with dual Xeon E5-2630 and 256GB
> of memory when running "make -j30 all" on 4.18.6 kernel (percentages are
> relative to base 4.19-rc8 kernel without xpfo):
> 
> 
> Base 4.19-rc8 913.84s
> 4.19-rc8 + xpfo, no TLB flush 1027.985s (+12.5%)
> 4.19-rc8 + batch update, no TLB flush 970.39s (+6.2%)
> 4.19-rc8 + xpfo, TLB flush8458.449s (+825.6%)
> 4.19-rc8 + batch update, TLB flush4665.659s (+410.6%)
> 
> Batch update is significant improvement but we are starting so far behind
> baseline, it is still a huge slow down.

There's some other stuff that Dave suggested that I didn't do; in
particular coalesce xpfo bits instead of setting things once per page
when mappings are shared, etc.

Perhaps that will help more?

I'm still stuck working on something else for now, but I hope to be
able to participate more on this Soon (TM). Thanks for the testing!

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-10-24 Thread Khalid Aziz

On 10/15/2018 01:37 PM, Khalid Aziz wrote:

On 09/24/2018 08:45 AM, Stecklina, Julian wrote:

I didn't test the version with TLB flushes, because it's clear that the
overhead is so bad that no one wants to use this.


I don't think we can ignore the vulnerability caused by not flushing 
stale TLB entries. On a mostly idle system, TLB entries hang around long 
enough to make it fairly easy to exploit this. I was able to use the 
additional test in lkdtm module added by this patch series to 
successfully read pages unmapped from physmap by just waiting for system 
to become idle. A rogue program can simply monitor system load and mount 
its attack using ret2dir exploit when system is mostly idle. This brings 
us back to the prohibitive cost of TLB flushes. If we are unmapping a 
page from physmap every time the page is allocated to userspace, we are 
forced to incur the cost of TLB flushes in some way. Work Tycho was 
doing to implement Dave's suggestion can help here. Once Tycho has 
something working, I can measure overhead on my test machine. Tycho, I 
can help with your implementation if you need.


I looked at Tycho's last patch with batch update from 
. I ported it on top of Julian's 
patches and got it working well enough to gather performance numbers. 
Here is what I see for system times on a machine with dual Xeon E5-2630 
and 256GB of memory when running "make -j30 all" on 4.18.6 kernel 
(percentages are relative to base 4.19-rc8 kernel without xpfo):



Base 4.19-rc8   913.84s
4.19-rc8 + xpfo, no TLB flush   1027.985s (+12.5%)
4.19-rc8 + batch update, no TLB flush   970.39s (+6.2%)
4.19-rc8 + xpfo, TLB flush  8458.449s (+825.6%)
4.19-rc8 + batch update, TLB flush  4665.659s (+410.6%)

Batch update is significant improvement but we are starting so far 
behind baseline, it is still a huge slow down.


--
Khalid



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-10-24 Thread Khalid Aziz

On 10/15/2018 01:37 PM, Khalid Aziz wrote:

On 09/24/2018 08:45 AM, Stecklina, Julian wrote:

I didn't test the version with TLB flushes, because it's clear that the
overhead is so bad that no one wants to use this.


I don't think we can ignore the vulnerability caused by not flushing 
stale TLB entries. On a mostly idle system, TLB entries hang around long 
enough to make it fairly easy to exploit this. I was able to use the 
additional test in lkdtm module added by this patch series to 
successfully read pages unmapped from physmap by just waiting for system 
to become idle. A rogue program can simply monitor system load and mount 
its attack using ret2dir exploit when system is mostly idle. This brings 
us back to the prohibitive cost of TLB flushes. If we are unmapping a 
page from physmap every time the page is allocated to userspace, we are 
forced to incur the cost of TLB flushes in some way. Work Tycho was 
doing to implement Dave's suggestion can help here. Once Tycho has 
something working, I can measure overhead on my test machine. Tycho, I 
can help with your implementation if you need.


I looked at Tycho's last patch with batch update from 
. I ported it on top of Julian's 
patches and got it working well enough to gather performance numbers. 
Here is what I see for system times on a machine with dual Xeon E5-2630 
and 256GB of memory when running "make -j30 all" on 4.18.6 kernel 
(percentages are relative to base 4.19-rc8 kernel without xpfo):



Base 4.19-rc8   913.84s
4.19-rc8 + xpfo, no TLB flush   1027.985s (+12.5%)
4.19-rc8 + batch update, no TLB flush   970.39s (+6.2%)
4.19-rc8 + xpfo, TLB flush  8458.449s (+825.6%)
4.19-rc8 + batch update, TLB flush  4665.659s (+410.6%)

Batch update is significant improvement but we are starting so far 
behind baseline, it is still a huge slow down.


--
Khalid



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-10-15 Thread Khalid Aziz

On 09/24/2018 08:45 AM, Stecklina, Julian wrote:

I didn't test the version with TLB flushes, because it's clear that the
overhead is so bad that no one wants to use this.


I don't think we can ignore the vulnerability caused by not flushing 
stale TLB entries. On a mostly idle system, TLB entries hang around long 
enough to make it fairly easy to exploit this. I was able to use the 
additional test in lkdtm module added by this patch series to 
successfully read pages unmapped from physmap by just waiting for system 
to become idle. A rogue program can simply monitor system load and mount 
its attack using ret2dir exploit when system is mostly idle. This brings 
us back to the prohibitive cost of TLB flushes. If we are unmapping a 
page from physmap every time the page is allocated to userspace, we are 
forced to incur the cost of TLB flushes in some way. Work Tycho was 
doing to implement Dave's suggestion can help here. Once Tycho has 
something working, I can measure overhead on my test machine. Tycho, I 
can help with your implementation if you need.


--
Khalid


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-10-15 Thread Khalid Aziz

On 09/24/2018 08:45 AM, Stecklina, Julian wrote:

I didn't test the version with TLB flushes, because it's clear that the
overhead is so bad that no one wants to use this.


I don't think we can ignore the vulnerability caused by not flushing 
stale TLB entries. On a mostly idle system, TLB entries hang around long 
enough to make it fairly easy to exploit this. I was able to use the 
additional test in lkdtm module added by this patch series to 
successfully read pages unmapped from physmap by just waiting for system 
to become idle. A rogue program can simply monitor system load and mount 
its attack using ret2dir exploit when system is mostly idle. This brings 
us back to the prohibitive cost of TLB flushes. If we are unmapping a 
page from physmap every time the page is allocated to userspace, we are 
forced to incur the cost of TLB flushes in some way. Work Tycho was 
doing to implement Dave's suggestion can help here. Once Tycho has 
something working, I can measure overhead on my test machine. Tycho, I 
can help with your implementation if you need.


--
Khalid


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-25 Thread Stecklina, Julian
On Sun, 2018-09-23 at 12:33 +1000, Balbir Singh wrote:
> > And in so doing, significantly reduces the amount of non-kernel
> data
> > vulnerable to speculative execution attacks against the kernel.
> > (and reduces what data can be loaded into the L1 data cache while
> > in kernel mode, to be peeked at by the recent L1 Terminal Fault
> > vulnerability).
> 
> I see and there is no way for gadgets to invoke this path from
> user space to make their speculation successful? We still have to
> flush L1, indepenedent of whether XPFO is enabled or not right?

Yes. And even with XPFO and L1 cache flushing enabled, there are more
steps that need to be taken to reliably guard against information leaks
using speculative execution.

Specifically, I'm looking into making certain allocations in the Linux
kernel process-local to hide even more memory from prefetching.

Another puzzle piece is co-scheduling support that is relevant for
systems with enabled hyperthreading: https://lwn.net/Articles/764461/

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-25 Thread Stecklina, Julian
On Sun, 2018-09-23 at 12:33 +1000, Balbir Singh wrote:
> > And in so doing, significantly reduces the amount of non-kernel
> data
> > vulnerable to speculative execution attacks against the kernel.
> > (and reduces what data can be loaded into the L1 data cache while
> > in kernel mode, to be peeked at by the recent L1 Terminal Fault
> > vulnerability).
> 
> I see and there is no way for gadgets to invoke this path from
> user space to make their speculation successful? We still have to
> flush L1, indepenedent of whether XPFO is enabled or not right?

Yes. And even with XPFO and L1 cache flushing enabled, there are more
steps that need to be taken to reliably guard against information leaks
using speculative execution.

Specifically, I'm looking into making certain allocations in the Linux
kernel process-local to hide even more memory from prefetching.

Another puzzle piece is co-scheduling support that is relevant for
systems with enabled hyperthreading: https://lwn.net/Articles/764461/

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-24 Thread Stecklina, Julian
On Tue, 2018-09-18 at 17:00 -0600, Khalid Aziz wrote:
> I tested the kernel with this new code. When booted without
> "xpfotlbflush", 
> there is no meaningful change in system time with kernel compile. 

That's good news! So the lock optimizations seem to help.

> Kernel 
> locks up during bootup when booted with xpfotlbflush:

I didn't test the version with TLB flushes, because it's clear that the
overhead is so bad that no one wants to use this.

It shouldn't lock up though, so maybe there is still a race condition
somewhere. I'll give this a spin on my end later this week.

Thanks for trying this out!

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-24 Thread Stecklina, Julian
On Tue, 2018-09-18 at 17:00 -0600, Khalid Aziz wrote:
> I tested the kernel with this new code. When booted without
> "xpfotlbflush", 
> there is no meaningful change in system time with kernel compile. 

That's good news! So the lock optimizations seem to help.

> Kernel 
> locks up during bootup when booted with xpfotlbflush:

I didn't test the version with TLB flushes, because it's clear that the
overhead is so bad that no one wants to use this.

It shouldn't lock up though, so maybe there is still a race condition
somewhere. I'll give this a spin on my end later this week.

Thanks for trying this out!

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-22 Thread Balbir Singh
On Wed, Sep 19, 2018 at 08:43:07AM -0700, Jonathan Adams wrote:
> (apologies again; resending due to formatting issues)
> On Tue, Sep 18, 2018 at 6:03 PM Balbir Singh  wrote:
> >
> > On Mon, Aug 20, 2018 at 09:52:19PM +, Woodhouse, David wrote:
> > > On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > > >
> > > > Of course, after the long (and entirely unrelated) discussion about
> > > > the TLB flushing bug we had, I'm starting to worry about my own
> > > > competence, and maybe I'm missing something really fundamental, and
> > > > the XPFO patches do something else than what I think they do, or my
> > > > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > > > that I'm missing.
> > >
> > > The interesting part is taking the user (and other) pages out of the
> > > kernel's 1:1 physmap.
> > >
> > > It's the *kernel* we don't want being able to access those pages,
> > > because of the multitude of unfixable cache load gadgets.
> >
> > I am missing why we need this since the kernel can't access
> > (SMAP) unless we go through to the copy/to/from interface
> > or execute any of the user pages. Is it because of the dependency
> > on the availability of those features?
> >
> SMAP protects against kernel accesses to non-PRIV (i.e. userspace)
> mappings, but that isn't relevant to what's being discussed here.
> 
> Davis is talking about the kernel Direct Map, which is a PRIV (i.e.
> kernel) mapping of all physical memory on the system, at
>   VA = (base + PA).
> Since this mapping exists for all physical addresses, speculative
> load gadgets (and the processor's prefetch mechanism, etc.) can
> load arbitrary data even if it is only otherwise mapped into user
> space.

Load aribtrary data with no permission checks (strict RWX).

> 
> XPFO fixes this by unmapping the Direct Map translations when the
> page is allocated as a user page. The mapping is only restored:
>1. temporarily if the kernel needs direct access to the page
>   (i.e. to zero it, access it from a device driver, etc),
>2. when the page is freed
> 
> And in so doing, significantly reduces the amount of non-kernel data
> vulnerable to speculative execution attacks against the kernel.
> (and reduces what data can be loaded into the L1 data cache while
> in kernel mode, to be peeked at by the recent L1 Terminal Fault
> vulnerability).

I see and there is no way for gadgets to invoke this path from
user space to make their speculation successful? We still have to
flush L1, indepenedent of whether XPFO is enabled or not right?

Balbir Singh.



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-22 Thread Balbir Singh
On Wed, Sep 19, 2018 at 08:43:07AM -0700, Jonathan Adams wrote:
> (apologies again; resending due to formatting issues)
> On Tue, Sep 18, 2018 at 6:03 PM Balbir Singh  wrote:
> >
> > On Mon, Aug 20, 2018 at 09:52:19PM +, Woodhouse, David wrote:
> > > On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > > >
> > > > Of course, after the long (and entirely unrelated) discussion about
> > > > the TLB flushing bug we had, I'm starting to worry about my own
> > > > competence, and maybe I'm missing something really fundamental, and
> > > > the XPFO patches do something else than what I think they do, or my
> > > > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > > > that I'm missing.
> > >
> > > The interesting part is taking the user (and other) pages out of the
> > > kernel's 1:1 physmap.
> > >
> > > It's the *kernel* we don't want being able to access those pages,
> > > because of the multitude of unfixable cache load gadgets.
> >
> > I am missing why we need this since the kernel can't access
> > (SMAP) unless we go through to the copy/to/from interface
> > or execute any of the user pages. Is it because of the dependency
> > on the availability of those features?
> >
> SMAP protects against kernel accesses to non-PRIV (i.e. userspace)
> mappings, but that isn't relevant to what's being discussed here.
> 
> Davis is talking about the kernel Direct Map, which is a PRIV (i.e.
> kernel) mapping of all physical memory on the system, at
>   VA = (base + PA).
> Since this mapping exists for all physical addresses, speculative
> load gadgets (and the processor's prefetch mechanism, etc.) can
> load arbitrary data even if it is only otherwise mapped into user
> space.

Load aribtrary data with no permission checks (strict RWX).

> 
> XPFO fixes this by unmapping the Direct Map translations when the
> page is allocated as a user page. The mapping is only restored:
>1. temporarily if the kernel needs direct access to the page
>   (i.e. to zero it, access it from a device driver, etc),
>2. when the page is freed
> 
> And in so doing, significantly reduces the amount of non-kernel data
> vulnerable to speculative execution attacks against the kernel.
> (and reduces what data can be loaded into the L1 data cache while
> in kernel mode, to be peeked at by the recent L1 Terminal Fault
> vulnerability).

I see and there is no way for gadgets to invoke this path from
user space to make their speculation successful? We still have to
flush L1, indepenedent of whether XPFO is enabled or not right?

Balbir Singh.



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-19 Thread Jonathan Adams
(apologies again; resending due to formatting issues)
On Tue, Sep 18, 2018 at 6:03 PM Balbir Singh  wrote:
>
> On Mon, Aug 20, 2018 at 09:52:19PM +, Woodhouse, David wrote:
> > On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > >
> > > Of course, after the long (and entirely unrelated) discussion about
> > > the TLB flushing bug we had, I'm starting to worry about my own
> > > competence, and maybe I'm missing something really fundamental, and
> > > the XPFO patches do something else than what I think they do, or my
> > > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > > that I'm missing.
> >
> > The interesting part is taking the user (and other) pages out of the
> > kernel's 1:1 physmap.
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
>
> I am missing why we need this since the kernel can't access
> (SMAP) unless we go through to the copy/to/from interface
> or execute any of the user pages. Is it because of the dependency
> on the availability of those features?
>
SMAP protects against kernel accesses to non-PRIV (i.e. userspace)
mappings, but that isn't relevant to what's being discussed here.

Davis is talking about the kernel Direct Map, which is a PRIV (i.e.
kernel) mapping of all physical memory on the system, at
  VA = (base + PA).
Since this mapping exists for all physical addresses, speculative
load gadgets (and the processor's prefetch mechanism, etc.) can
load arbitrary data even if it is only otherwise mapped into user
space.

XPFO fixes this by unmapping the Direct Map translations when the
page is allocated as a user page. The mapping is only restored:
   1. temporarily if the kernel needs direct access to the page
  (i.e. to zero it, access it from a device driver, etc),
   2. when the page is freed

And in so doing, significantly reduces the amount of non-kernel data
vulnerable to speculative execution attacks against the kernel.
(and reduces what data can be loaded into the L1 data cache while
in kernel mode, to be peeked at by the recent L1 Terminal Fault
vulnerability).

Does that make sense?

Cheers,
- jonathan


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-19 Thread Jonathan Adams
(apologies again; resending due to formatting issues)
On Tue, Sep 18, 2018 at 6:03 PM Balbir Singh  wrote:
>
> On Mon, Aug 20, 2018 at 09:52:19PM +, Woodhouse, David wrote:
> > On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > >
> > > Of course, after the long (and entirely unrelated) discussion about
> > > the TLB flushing bug we had, I'm starting to worry about my own
> > > competence, and maybe I'm missing something really fundamental, and
> > > the XPFO patches do something else than what I think they do, or my
> > > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > > that I'm missing.
> >
> > The interesting part is taking the user (and other) pages out of the
> > kernel's 1:1 physmap.
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
>
> I am missing why we need this since the kernel can't access
> (SMAP) unless we go through to the copy/to/from interface
> or execute any of the user pages. Is it because of the dependency
> on the availability of those features?
>
SMAP protects against kernel accesses to non-PRIV (i.e. userspace)
mappings, but that isn't relevant to what's being discussed here.

Davis is talking about the kernel Direct Map, which is a PRIV (i.e.
kernel) mapping of all physical memory on the system, at
  VA = (base + PA).
Since this mapping exists for all physical addresses, speculative
load gadgets (and the processor's prefetch mechanism, etc.) can
load arbitrary data even if it is only otherwise mapped into user
space.

XPFO fixes this by unmapping the Direct Map translations when the
page is allocated as a user page. The mapping is only restored:
   1. temporarily if the kernel needs direct access to the page
  (i.e. to zero it, access it from a device driver, etc),
   2. when the page is freed

And in so doing, significantly reduces the amount of non-kernel data
vulnerable to speculative execution attacks against the kernel.
(and reduces what data can be loaded into the L1 data cache while
in kernel mode, to be peeked at by the recent L1 Terminal Fault
vulnerability).

Does that make sense?

Cheers,
- jonathan


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-18 Thread Balbir Singh
On Mon, Aug 20, 2018 at 09:52:19PM +, Woodhouse, David wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > 
> > Of course, after the long (and entirely unrelated) discussion about
> > the TLB flushing bug we had, I'm starting to worry about my own
> > competence, and maybe I'm missing something really fundamental, and
> > the XPFO patches do something else than what I think they do, or my
> > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > that I'm missing.
> 
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
> 
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

I am missing why we need this since the kernel can't access
(SMAP) unless we go through to the copy/to/from interface
or execute any of the user pages. Is it because of the dependency
on the availability of those features?

Balbir Singh.



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-18 Thread Balbir Singh
On Mon, Aug 20, 2018 at 09:52:19PM +, Woodhouse, David wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > 
> > Of course, after the long (and entirely unrelated) discussion about
> > the TLB flushing bug we had, I'm starting to worry about my own
> > competence, and maybe I'm missing something really fundamental, and
> > the XPFO patches do something else than what I think they do, or my
> > "hey, let's use our Meltdown code" idea has some fundamental weakness
> > that I'm missing.
> 
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
> 
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

I am missing why we need this since the kernel can't access
(SMAP) unless we go through to the copy/to/from interface
or execute any of the user pages. Is it because of the dependency
on the availability of those features?

Balbir Singh.



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-18 Thread Khalid Aziz
On 09/17/2018 03:51 AM, Julian Stecklina wrote:
> Khalid Aziz  writes:
> 
>> I ran tests with your updated code and gathered lock statistics. Change in
>> system time for "make -j60" was in the noise margin (It actually went up by
>> about 2%). There is some contention on xpfo_lock. Average wait time does not
>> look high compared to other locks. Max hold time looks a little long. From
>> /proc/lock_stat:
>>
>>&(>xpfo_lock)->rlock: 29698  29897  
>>  0.06 134.39   15345.58   0.51  422474670  
>> 960222532   0.05   30362.05   195807002.62   0.20
>>
>> Nevertheless even a smaller average wait time can add up.
> 
> Thanks for doing this!
> 
> I've spent some time optimizing spinlock usage in the code. See the two
> last commits in my xpfo-master branch[1]. The optimization in
> xpfo_kunmap is pretty safe. The last commit that optimizes locking in
> xpfo_kmap is tricky, though, and I'm not sure this is the right
> approach. FWIW, I've modeled this locking strategy in Spin and it
> doesn't find any problems with it.
> 
> I've tested the result on a box with 72 hardware threads and I didn't
> see a meaningful difference in kernel compile performance. It's still
> hovering around 2%. So the question is, whether it's actually useful to
> do these optimizations.
> 
> Khalid, you mentioned 5% overhead. Can you give the new code a spin and
> see whether anything changes?

Hi Julian,

I tested the kernel with this new code. When booted without "xpfotlbflush", 
there is no meaningful change in system time with kernel compile. Kernel 
locks up during bootup when booted with xpfotlbflush:

[   52.967060] RIP: 0010:queued_spin_lock_slowpath+0xf6/0x1e0
[   52.967061] Code: 48 03 34 c5 80 97 12 82 48 89 16 8b 42 08 85 c0 75 09 f3 
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 
85 c0 75 f7 41 89 c0 66 45 31 c0 41 39 c8 0f 84 93 00 00
[   52.967061] RSP: 0018:c9001cc83a00 EFLAGS: 0002
[   52.967062] RAX: 00340101 RBX: ea06c16292e8 RCX: 0058
[   52.967062] RDX: 88603c9e3980 RSI:  RDI: ea06c16292e8
[   52.967063] RBP: ea06c1629300 R08: 0001 R09: 
[   52.967063] R10:  R11: 0001 R12: 88c02765a000
[   52.967063] R13:  R14: 8860152a0d00 R15: 
[   52.967064] FS:  7f41ad1658c0() GS:88603c80() 
knlGS:
[   52.967064] CS:  0010 DS:  ES:  CR0: 80050033
[   52.967064] CR2: 88c02765a000 CR3: 0060252e4003 CR4: 007606e0
[   52.967065] DR0:  DR1:  DR2: 
[   52.967065] DR3:  DR6: fffe0ff0 DR7: 0400
[   52.967065] PKRU: 5554
[   52.967066] Call Trace:
[   52.967066]  do_raw_spin_lock+0x6d/0xa0
[   52.967066]  _raw_spin_lock+0x53/0x70
[   52.967067]  ? xpfo_do_map+0x1b/0x52
[   52.967067]  xpfo_do_map+0x1b/0x52
[   52.967067]  xpfo_spurious_fault+0xac/0xae
[   52.967068]  __do_page_fault+0x3cc/0x4e0
[   52.967068]  ? __lock_acquire.isra.31+0x165/0x710
[   52.967068]  do_page_fault+0x32/0x180
[   52.967068]  page_fault+0x1e/0x30
[   52.967069] RIP: 0010:memcpy_erms+0x6/0x10
[   52.967069] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 
83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1  a4 c3 
0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[   52.967070] RSP: 0018:c9001cc83bb8 EFLAGS: 00010246
[   52.967070] RAX: 8860299d0f00 RBX: c9001cc83dc8 RCX: 0080
[   52.967071] RDX: 0080 RSI: 88c02765a000 RDI: 8860299d0f00
[   52.967071] RBP: 0080 R08: c9001cc83d90 R09: 0001
[   52.967071] R10:  R11: 0001 R12: 0080
[   52.967072] R13: 0080 R14:  R15: 88c02765a080
[   52.967072]  _copy_to_iter+0x3b6/0x430
[   52.967072]  copy_page_to_iter+0x1cf/0x390
[   52.967073]  ? pagecache_get_page+0x26/0x200
[   52.967073]  generic_file_read_iter+0x620/0xaf0
[   52.967073]  ? avc_has_perm+0x12e/0x200
[   52.967074]  ? avc_has_perm+0x34/0x200
[   52.967074]  ? sched_clock+0x5/0x10
[   52.967074]  __vfs_read+0x112/0x190
[   52.967074]  vfs_read+0x8c/0x140
[   52.967075]  kernel_read+0x2c/0x40
[   52.967075]  prepare_binprm+0x121/0x230
[   52.967075]  __do_execve_file.isra.32+0x56f/0x930
[   52.967076]  ? __do_execve_file.isra.32+0x140/0x930
[   52.967076]  __x64_sys_execve+0x44/0x50
[   52.967076]  do_syscall_64+0x5b/0x190
[   52.967077]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   52.967077] RIP: 0033:0x7f41abd898c7
[   52.967078] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 
41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 02 f3 c3 48 8b 15 98 05 30 00 f7 d8 64 89 02
[   52.967078] RSP: 

Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-18 Thread Khalid Aziz
On 09/17/2018 03:51 AM, Julian Stecklina wrote:
> Khalid Aziz  writes:
> 
>> I ran tests with your updated code and gathered lock statistics. Change in
>> system time for "make -j60" was in the noise margin (It actually went up by
>> about 2%). There is some contention on xpfo_lock. Average wait time does not
>> look high compared to other locks. Max hold time looks a little long. From
>> /proc/lock_stat:
>>
>>&(>xpfo_lock)->rlock: 29698  29897  
>>  0.06 134.39   15345.58   0.51  422474670  
>> 960222532   0.05   30362.05   195807002.62   0.20
>>
>> Nevertheless even a smaller average wait time can add up.
> 
> Thanks for doing this!
> 
> I've spent some time optimizing spinlock usage in the code. See the two
> last commits in my xpfo-master branch[1]. The optimization in
> xpfo_kunmap is pretty safe. The last commit that optimizes locking in
> xpfo_kmap is tricky, though, and I'm not sure this is the right
> approach. FWIW, I've modeled this locking strategy in Spin and it
> doesn't find any problems with it.
> 
> I've tested the result on a box with 72 hardware threads and I didn't
> see a meaningful difference in kernel compile performance. It's still
> hovering around 2%. So the question is, whether it's actually useful to
> do these optimizations.
> 
> Khalid, you mentioned 5% overhead. Can you give the new code a spin and
> see whether anything changes?

Hi Julian,

I tested the kernel with this new code. When booted without "xpfotlbflush", 
there is no meaningful change in system time with kernel compile. Kernel 
locks up during bootup when booted with xpfotlbflush:

[   52.967060] RIP: 0010:queued_spin_lock_slowpath+0xf6/0x1e0
[   52.967061] Code: 48 03 34 c5 80 97 12 82 48 89 16 8b 42 08 85 c0 75 09 f3 
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 <8b> 07 66 
85 c0 75 f7 41 89 c0 66 45 31 c0 41 39 c8 0f 84 93 00 00
[   52.967061] RSP: 0018:c9001cc83a00 EFLAGS: 0002
[   52.967062] RAX: 00340101 RBX: ea06c16292e8 RCX: 0058
[   52.967062] RDX: 88603c9e3980 RSI:  RDI: ea06c16292e8
[   52.967063] RBP: ea06c1629300 R08: 0001 R09: 
[   52.967063] R10:  R11: 0001 R12: 88c02765a000
[   52.967063] R13:  R14: 8860152a0d00 R15: 
[   52.967064] FS:  7f41ad1658c0() GS:88603c80() 
knlGS:
[   52.967064] CS:  0010 DS:  ES:  CR0: 80050033
[   52.967064] CR2: 88c02765a000 CR3: 0060252e4003 CR4: 007606e0
[   52.967065] DR0:  DR1:  DR2: 
[   52.967065] DR3:  DR6: fffe0ff0 DR7: 0400
[   52.967065] PKRU: 5554
[   52.967066] Call Trace:
[   52.967066]  do_raw_spin_lock+0x6d/0xa0
[   52.967066]  _raw_spin_lock+0x53/0x70
[   52.967067]  ? xpfo_do_map+0x1b/0x52
[   52.967067]  xpfo_do_map+0x1b/0x52
[   52.967067]  xpfo_spurious_fault+0xac/0xae
[   52.967068]  __do_page_fault+0x3cc/0x4e0
[   52.967068]  ? __lock_acquire.isra.31+0x165/0x710
[   52.967068]  do_page_fault+0x32/0x180
[   52.967068]  page_fault+0x1e/0x30
[   52.967069] RIP: 0010:memcpy_erms+0x6/0x10
[   52.967069] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 
83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1  a4 c3 
0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[   52.967070] RSP: 0018:c9001cc83bb8 EFLAGS: 00010246
[   52.967070] RAX: 8860299d0f00 RBX: c9001cc83dc8 RCX: 0080
[   52.967071] RDX: 0080 RSI: 88c02765a000 RDI: 8860299d0f00
[   52.967071] RBP: 0080 R08: c9001cc83d90 R09: 0001
[   52.967071] R10:  R11: 0001 R12: 0080
[   52.967072] R13: 0080 R14:  R15: 88c02765a080
[   52.967072]  _copy_to_iter+0x3b6/0x430
[   52.967072]  copy_page_to_iter+0x1cf/0x390
[   52.967073]  ? pagecache_get_page+0x26/0x200
[   52.967073]  generic_file_read_iter+0x620/0xaf0
[   52.967073]  ? avc_has_perm+0x12e/0x200
[   52.967074]  ? avc_has_perm+0x34/0x200
[   52.967074]  ? sched_clock+0x5/0x10
[   52.967074]  __vfs_read+0x112/0x190
[   52.967074]  vfs_read+0x8c/0x140
[   52.967075]  kernel_read+0x2c/0x40
[   52.967075]  prepare_binprm+0x121/0x230
[   52.967075]  __do_execve_file.isra.32+0x56f/0x930
[   52.967076]  ? __do_execve_file.isra.32+0x140/0x930
[   52.967076]  __x64_sys_execve+0x44/0x50
[   52.967076]  do_syscall_64+0x5b/0x190
[   52.967077]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   52.967077] RIP: 0033:0x7f41abd898c7
[   52.967078] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 
41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 02 f3 c3 48 8b 15 98 05 30 00 f7 d8 64 89 02
[   52.967078] RSP: 

Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Christoph Hellwig
On Mon, Sep 17, 2018 at 12:01:02PM +0200, Julian Stecklina wrote:
> Juerg Haefliger  writes:
> 
> >> I've updated my XPFO branch[1] to make some of the debugging optional
> >> and also integrated the XPFO bookkeeping with struct page, instead of
> >> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> >> path.
> >
> > FWIW, that was my original design but there was some resistance to
> > adding more to the page struct and page extension was suggested
> > instead.
> 
> >From looking at both versions, I have to say that having the metadata in
> struct page makes the code easier to understand and removes some special
> cases and bookkeeping.

Btw, can xpfo_lock be replaced with a bit spinlock in the page?
Growing struct page too much might cause performance issues.  Then again
going beyong the 64 byte cache line might already cause that, and even
then it propbably is still way better than the page extensions.

OTOH if you keep the spinlock it might be worth to use
atomic_dec_and_lock on the count.  Maybe the answer is an hash of
spinlock, as we obviously can't take all that many of them at the same
time anyway.

Also for your trasitions froms zero it might be worth at looking at
atomic_inc_unless_zero.


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Christoph Hellwig
On Mon, Sep 17, 2018 at 12:01:02PM +0200, Julian Stecklina wrote:
> Juerg Haefliger  writes:
> 
> >> I've updated my XPFO branch[1] to make some of the debugging optional
> >> and also integrated the XPFO bookkeeping with struct page, instead of
> >> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> >> path.
> >
> > FWIW, that was my original design but there was some resistance to
> > adding more to the page struct and page extension was suggested
> > instead.
> 
> >From looking at both versions, I have to say that having the metadata in
> struct page makes the code easier to understand and removes some special
> cases and bookkeeping.

Btw, can xpfo_lock be replaced with a bit spinlock in the page?
Growing struct page too much might cause performance issues.  Then again
going beyong the 64 byte cache line might already cause that, and even
then it propbably is still way better than the page extensions.

OTOH if you keep the spinlock it might be worth to use
atomic_dec_and_lock on the count.  Maybe the answer is an hash of
spinlock, as we obviously can't take all that many of them at the same
time anyway.

Also for your trasitions froms zero it might be worth at looking at
atomic_inc_unless_zero.


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Tycho Andersen
On Mon, Sep 17, 2018 at 12:01:02PM +0200, Julian Stecklina wrote:
> Juerg Haefliger  writes:
> 
> >> I've updated my XPFO branch[1] to make some of the debugging optional
> >> and also integrated the XPFO bookkeeping with struct page, instead of
> >> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> >> path.
> >
> > FWIW, that was my original design but there was some resistance to
> > adding more to the page struct and page extension was suggested
> > instead.
> 
> From looking at both versions, I have to say that having the metadata in
> struct page makes the code easier to understand and removes some special
> cases and bookkeeping.
> 
> > I'm wondering how much performance we're loosing by having to split
> > hugepages. Any chance this can be quantified somehow? Maybe we can
> > have a pool of some sorts reserved for userpages and group allocations
> > so that we can track the XPFO state at the hugepage level instead of
> > at the 4k level to prevent/reduce page splitting. Not sure if that
> > causes issues or has any unwanted side effects though...
> 
> Optimizing the allocation/deallocation path might be worthwhile, because
> that's where most of the overhead goes. I haven't looked into how to do
> this yet. I'd appreciate if someone has pointers to code that tries to
> achieve similar functionality to get me started.
> 
> That being said, I'm wondering whether we have unrealistic expectations
> about the overhead here and whether it's worth turning this patch into
> something far more complicated. Opinions?

I think that implementing Dave Hansen's suggestions of not doing
flushes/other work on every map/unmap, but only when pages are added
to the various free lists will probably help out a lot. That's where I
got stuck last time when I was trying to do it, though :)

Cheers,

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Tycho Andersen
On Mon, Sep 17, 2018 at 12:01:02PM +0200, Julian Stecklina wrote:
> Juerg Haefliger  writes:
> 
> >> I've updated my XPFO branch[1] to make some of the debugging optional
> >> and also integrated the XPFO bookkeeping with struct page, instead of
> >> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> >> path.
> >
> > FWIW, that was my original design but there was some resistance to
> > adding more to the page struct and page extension was suggested
> > instead.
> 
> From looking at both versions, I have to say that having the metadata in
> struct page makes the code easier to understand and removes some special
> cases and bookkeeping.
> 
> > I'm wondering how much performance we're loosing by having to split
> > hugepages. Any chance this can be quantified somehow? Maybe we can
> > have a pool of some sorts reserved for userpages and group allocations
> > so that we can track the XPFO state at the hugepage level instead of
> > at the 4k level to prevent/reduce page splitting. Not sure if that
> > causes issues or has any unwanted side effects though...
> 
> Optimizing the allocation/deallocation path might be worthwhile, because
> that's where most of the overhead goes. I haven't looked into how to do
> this yet. I'd appreciate if someone has pointers to code that tries to
> achieve similar functionality to get me started.
> 
> That being said, I'm wondering whether we have unrealistic expectations
> about the overhead here and whether it's worth turning this patch into
> something far more complicated. Opinions?

I think that implementing Dave Hansen's suggestions of not doing
flushes/other work on every map/unmap, but only when pages are added
to the various free lists will probably help out a lot. That's where I
got stuck last time when I was trying to do it, though :)

Cheers,

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Julian Stecklina
Juerg Haefliger  writes:

>> I've updated my XPFO branch[1] to make some of the debugging optional
>> and also integrated the XPFO bookkeeping with struct page, instead of
>> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
>> path.
>
> FWIW, that was my original design but there was some resistance to
> adding more to the page struct and page extension was suggested
> instead.

>From looking at both versions, I have to say that having the metadata in
struct page makes the code easier to understand and removes some special
cases and bookkeeping.

> I'm wondering how much performance we're loosing by having to split
> hugepages. Any chance this can be quantified somehow? Maybe we can
> have a pool of some sorts reserved for userpages and group allocations
> so that we can track the XPFO state at the hugepage level instead of
> at the 4k level to prevent/reduce page splitting. Not sure if that
> causes issues or has any unwanted side effects though...

Optimizing the allocation/deallocation path might be worthwhile, because
that's where most of the overhead goes. I haven't looked into how to do
this yet. I'd appreciate if someone has pointers to code that tries to
achieve similar functionality to get me started.

That being said, I'm wondering whether we have unrealistic expectations
about the overhead here and whether it's worth turning this patch into
something far more complicated. Opinions?

Julian
--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Julian Stecklina
Juerg Haefliger  writes:

>> I've updated my XPFO branch[1] to make some of the debugging optional
>> and also integrated the XPFO bookkeeping with struct page, instead of
>> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
>> path.
>
> FWIW, that was my original design but there was some resistance to
> adding more to the page struct and page extension was suggested
> instead.

>From looking at both versions, I have to say that having the metadata in
struct page makes the code easier to understand and removes some special
cases and bookkeeping.

> I'm wondering how much performance we're loosing by having to split
> hugepages. Any chance this can be quantified somehow? Maybe we can
> have a pool of some sorts reserved for userpages and group allocations
> so that we can track the XPFO state at the hugepage level instead of
> at the 4k level to prevent/reduce page splitting. Not sure if that
> causes issues or has any unwanted side effects though...

Optimizing the allocation/deallocation path might be worthwhile, because
that's where most of the overhead goes. I haven't looked into how to do
this yet. I'd appreciate if someone has pointers to code that tries to
achieve similar functionality to get me started.

That being said, I'm wondering whether we have unrealistic expectations
about the overhead here and whether it's worth turning this patch into
something far more complicated. Opinions?

Julian
--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Julian Stecklina
Khalid Aziz  writes:

> I ran tests with your updated code and gathered lock statistics. Change in
> system time for "make -j60" was in the noise margin (It actually went up by
> about 2%). There is some contention on xpfo_lock. Average wait time does not
> look high compared to other locks. Max hold time looks a little long. From
> /proc/lock_stat:
>
>   &(>xpfo_lock)->rlock: 29698  29897
>0.06 134.39   15345.58   0.51  422474670  
> 960222532   0.05   30362.05   195807002.62   0.20
>
> Nevertheless even a smaller average wait time can add up.

Thanks for doing this!

I've spent some time optimizing spinlock usage in the code. See the two
last commits in my xpfo-master branch[1]. The optimization in
xpfo_kunmap is pretty safe. The last commit that optimizes locking in
xpfo_kmap is tricky, though, and I'm not sure this is the right
approach. FWIW, I've modeled this locking strategy in Spin and it
doesn't find any problems with it.

I've tested the result on a box with 72 hardware threads and I didn't
see a meaningful difference in kernel compile performance. It's still
hovering around 2%. So the question is, whether it's actually useful to
do these optimizations.

Khalid, you mentioned 5% overhead. Can you give the new code a spin and
see whether anything changes?

Julian

[1] 
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-17 Thread Julian Stecklina
Khalid Aziz  writes:

> I ran tests with your updated code and gathered lock statistics. Change in
> system time for "make -j60" was in the noise margin (It actually went up by
> about 2%). There is some contention on xpfo_lock. Average wait time does not
> look high compared to other locks. Max hold time looks a little long. From
> /proc/lock_stat:
>
>   &(>xpfo_lock)->rlock: 29698  29897
>0.06 134.39   15345.58   0.51  422474670  
> 960222532   0.05   30362.05   195807002.62   0.20
>
> Nevertheless even a smaller average wait time can add up.

Thanks for doing this!

I've spent some time optimizing spinlock usage in the code. See the two
last commits in my xpfo-master branch[1]. The optimization in
xpfo_kunmap is pretty safe. The last commit that optimizes locking in
xpfo_kmap is tricky, though, and I'm not sure this is the right
approach. FWIW, I've modeled this locking strategy in Spin and it
doesn't find any problems with it.

I've tested the result on a box with 72 hardware threads and I didn't
see a meaningful difference in kernel compile performance. It's still
hovering around 2%. So the question is, whether it's actually useful to
do these optimizations.

Khalid, you mentioned 5% overhead. Can you give the new code a spin and
see whether anything changes?

Julian

[1] 
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-14 Thread Khalid Aziz
On 09/12/2018 09:37 AM, Julian Stecklina wrote:
> Julian Stecklina  writes:
> 
>> Linus Torvalds  writes:
>>
>>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  
>>> wrote:

 I've been spending some cycles on the XPFO patch set this week. For the
 patch set as it was posted for v4.13, the performance overhead of
 compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
 completely from TLB flushing. If we can live with stale TLB entries
 allowing temporary access (which I think is reasonable), we can remove
 all TLB flushing (on x86). This reduces the overhead to 2-3% for
 kernel compile.
>>>
>>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>>
>> Well, it's at least in a range where it doesn't look hopeless.
>>
>>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>>> from a kernel is not some small unnoticeable thing.
>>
>> The overhead seems to come from the hooks that XPFO adds to
>> alloc/free_pages. These hooks add a couple of atomic operations per
>> allocated (4K) page for book keeping. Some of these atomic ops are only
>> for debugging and could be removed. There is also some opportunity to
>> streamline the per-page space overhead of XPFO.
> 
> I've updated my XPFO branch[1] to make some of the debugging optional
> and also integrated the XPFO bookkeeping with struct page, instead of
> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> path. These changes push the overhead down to somewhere between 1.5 and
> 2% for my quad core box in kernel compile. This is close to the
> measurement noise, so I take suggestions for a better benchmark here.
> 
> Of course, if you hit contention on the xpfo spinlock then performance
> will suffer. I guess this is what happened on Khalid's large box.
> 
> I'll try to remove the spinlocks and add fixup code to the pagefault
> handler to see whether this improves the situation on large boxes. This
> might turn out to be ugly, though.
> 

Hi Julian,

I ran tests with your updated code and gathered lock statistics. Change in 
system time for "make -j60" was in the noise margin (It actually went up by 
about 2%). There is some contention on xpfo_lock. Average wait time does not 
look high compared to other locks. Max hold time looks a little long. From 
/proc/lock_stat:

  &(>xpfo_lock)->rlock: 29698  29897  
 0.06 134.39   15345.58   0.51  422474670  
960222532   0.05   30362.05   195807002.62   0.20

Nevertheless even a smaller average wait time can add up.

--
Khalid





Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-14 Thread Khalid Aziz
On 09/12/2018 09:37 AM, Julian Stecklina wrote:
> Julian Stecklina  writes:
> 
>> Linus Torvalds  writes:
>>
>>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  
>>> wrote:

 I've been spending some cycles on the XPFO patch set this week. For the
 patch set as it was posted for v4.13, the performance overhead of
 compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
 completely from TLB flushing. If we can live with stale TLB entries
 allowing temporary access (which I think is reasonable), we can remove
 all TLB flushing (on x86). This reduces the overhead to 2-3% for
 kernel compile.
>>>
>>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>>
>> Well, it's at least in a range where it doesn't look hopeless.
>>
>>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>>> from a kernel is not some small unnoticeable thing.
>>
>> The overhead seems to come from the hooks that XPFO adds to
>> alloc/free_pages. These hooks add a couple of atomic operations per
>> allocated (4K) page for book keeping. Some of these atomic ops are only
>> for debugging and could be removed. There is also some opportunity to
>> streamline the per-page space overhead of XPFO.
> 
> I've updated my XPFO branch[1] to make some of the debugging optional
> and also integrated the XPFO bookkeeping with struct page, instead of
> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> path. These changes push the overhead down to somewhere between 1.5 and
> 2% for my quad core box in kernel compile. This is close to the
> measurement noise, so I take suggestions for a better benchmark here.
> 
> Of course, if you hit contention on the xpfo spinlock then performance
> will suffer. I guess this is what happened on Khalid's large box.
> 
> I'll try to remove the spinlocks and add fixup code to the pagefault
> handler to see whether this improves the situation on large boxes. This
> might turn out to be ugly, though.
> 

Hi Julian,

I ran tests with your updated code and gathered lock statistics. Change in 
system time for "make -j60" was in the noise margin (It actually went up by 
about 2%). There is some contention on xpfo_lock. Average wait time does not 
look high compared to other locks. Max hold time looks a little long. From 
/proc/lock_stat:

  &(>xpfo_lock)->rlock: 29698  29897  
 0.06 134.39   15345.58   0.51  422474670  
960222532   0.05   30362.05   195807002.62   0.20

Nevertheless even a smaller average wait time can add up.

--
Khalid





Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-13 Thread Juerg Haefliger
On Wed, Sep 12, 2018 at 5:37 PM, Julian Stecklina  wrote:
> Julian Stecklina  writes:
>
>> Linus Torvalds  writes:
>>
>>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  
>>> wrote:

 I've been spending some cycles on the XPFO patch set this week. For the
 patch set as it was posted for v4.13, the performance overhead of
 compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
 completely from TLB flushing. If we can live with stale TLB entries
 allowing temporary access (which I think is reasonable), we can remove
 all TLB flushing (on x86). This reduces the overhead to 2-3% for
 kernel compile.
>>>
>>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>>
>> Well, it's at least in a range where it doesn't look hopeless.
>>
>>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>>> from a kernel is not some small unnoticeable thing.
>>
>> The overhead seems to come from the hooks that XPFO adds to
>> alloc/free_pages. These hooks add a couple of atomic operations per
>> allocated (4K) page for book keeping. Some of these atomic ops are only
>> for debugging and could be removed. There is also some opportunity to
>> streamline the per-page space overhead of XPFO.
>
> I've updated my XPFO branch[1] to make some of the debugging optional
> and also integrated the XPFO bookkeeping with struct page, instead of
> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> path.

FWIW, that was my original design but there was some resistance to
adding more to the page struct and page extension was suggested
instead.


> These changes push the overhead down to somewhere between 1.5 and
> 2% for my quad core box in kernel compile. This is close to the
> measurement noise, so I take suggestions for a better benchmark here.
>
> Of course, if you hit contention on the xpfo spinlock then performance
> will suffer. I guess this is what happened on Khalid's large box.
>
> I'll try to remove the spinlocks and add fixup code to the pagefault
> handler to see whether this improves the situation on large boxes. This
> might turn out to be ugly, though.

I'm wondering how much performance we're loosing by having to split
hugepages. Any chance this can be quantified somehow? Maybe we can
have a pool of some sorts reserved for userpages and group allocations
so that we can track the XPFO state at the hugepage level instead of
at the 4k level to prevent/reduce page splitting. Not sure if that
causes issues or has any unwanted side effects though...

...Juerg


> Julian
>
> [1] 
> http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master
> --
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-13 Thread Juerg Haefliger
On Wed, Sep 12, 2018 at 5:37 PM, Julian Stecklina  wrote:
> Julian Stecklina  writes:
>
>> Linus Torvalds  writes:
>>
>>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  
>>> wrote:

 I've been spending some cycles on the XPFO patch set this week. For the
 patch set as it was posted for v4.13, the performance overhead of
 compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
 completely from TLB flushing. If we can live with stale TLB entries
 allowing temporary access (which I think is reasonable), we can remove
 all TLB flushing (on x86). This reduces the overhead to 2-3% for
 kernel compile.
>>>
>>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>>
>> Well, it's at least in a range where it doesn't look hopeless.
>>
>>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>>> from a kernel is not some small unnoticeable thing.
>>
>> The overhead seems to come from the hooks that XPFO adds to
>> alloc/free_pages. These hooks add a couple of atomic operations per
>> allocated (4K) page for book keeping. Some of these atomic ops are only
>> for debugging and could be removed. There is also some opportunity to
>> streamline the per-page space overhead of XPFO.
>
> I've updated my XPFO branch[1] to make some of the debugging optional
> and also integrated the XPFO bookkeeping with struct page, instead of
> requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
> path.

FWIW, that was my original design but there was some resistance to
adding more to the page struct and page extension was suggested
instead.


> These changes push the overhead down to somewhere between 1.5 and
> 2% for my quad core box in kernel compile. This is close to the
> measurement noise, so I take suggestions for a better benchmark here.
>
> Of course, if you hit contention on the xpfo spinlock then performance
> will suffer. I guess this is what happened on Khalid's large box.
>
> I'll try to remove the spinlocks and add fixup code to the pagefault
> handler to see whether this improves the situation on large boxes. This
> might turn out to be ugly, though.

I'm wondering how much performance we're loosing by having to split
hugepages. Any chance this can be quantified somehow? Maybe we can
have a pool of some sorts reserved for userpages and group allocations
so that we can track the XPFO state at the hugepage level instead of
at the 4k level to prevent/reduce page splitting. Not sure if that
causes issues or has any unwanted side effects though...

...Juerg


> Julian
>
> [1] 
> http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master
> --
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-12 Thread Julian Stecklina
Julian Stecklina  writes:

> Linus Torvalds  writes:
>
>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>>>
>>> I've been spending some cycles on the XPFO patch set this week. For the
>>> patch set as it was posted for v4.13, the performance overhead of
>>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>>> completely from TLB flushing. If we can live with stale TLB entries
>>> allowing temporary access (which I think is reasonable), we can remove
>>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>>> kernel compile.
>>
>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>
> Well, it's at least in a range where it doesn't look hopeless.
>
>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>> from a kernel is not some small unnoticeable thing.
>
> The overhead seems to come from the hooks that XPFO adds to
> alloc/free_pages. These hooks add a couple of atomic operations per
> allocated (4K) page for book keeping. Some of these atomic ops are only
> for debugging and could be removed. There is also some opportunity to
> streamline the per-page space overhead of XPFO.

I've updated my XPFO branch[1] to make some of the debugging optional
and also integrated the XPFO bookkeeping with struct page, instead of
requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
path. These changes push the overhead down to somewhere between 1.5 and
2% for my quad core box in kernel compile. This is close to the
measurement noise, so I take suggestions for a better benchmark here.

Of course, if you hit contention on the xpfo spinlock then performance
will suffer. I guess this is what happened on Khalid's large box.

I'll try to remove the spinlocks and add fixup code to the pagefault
handler to see whether this improves the situation on large boxes. This
might turn out to be ugly, though.

Julian

[1] 
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master
--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-12 Thread Julian Stecklina
Julian Stecklina  writes:

> Linus Torvalds  writes:
>
>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>>>
>>> I've been spending some cycles on the XPFO patch set this week. For the
>>> patch set as it was posted for v4.13, the performance overhead of
>>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>>> completely from TLB flushing. If we can live with stale TLB entries
>>> allowing temporary access (which I think is reasonable), we can remove
>>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>>> kernel compile.
>>
>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>
> Well, it's at least in a range where it doesn't look hopeless.
>
>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>> from a kernel is not some small unnoticeable thing.
>
> The overhead seems to come from the hooks that XPFO adds to
> alloc/free_pages. These hooks add a couple of atomic operations per
> allocated (4K) page for book keeping. Some of these atomic ops are only
> for debugging and could be removed. There is also some opportunity to
> streamline the per-page space overhead of XPFO.

I've updated my XPFO branch[1] to make some of the debugging optional
and also integrated the XPFO bookkeeping with struct page, instead of
requiring CONFIG_PAGE_EXTENSION, which removes some checks in the hot
path. These changes push the overhead down to somewhere between 1.5 and
2% for my quad core box in kernel compile. This is close to the
measurement noise, so I take suggestions for a better benchmark here.

Of course, if you hit contention on the xpfo spinlock then performance
will suffer. I guess this is what happened on Khalid's large box.

I'll try to remove the spinlocks and add fixup code to the pagefault
handler to see whether this improves the situation on large boxes. This
might turn out to be ugly, though.

Julian

[1] 
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master
--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-07 Thread Khalid Aziz

On 08/30/2018 10:00 AM, Julian Stecklina wrote:

Hey everyone,

On Mon, 20 Aug 2018 15:27 Linus Torvalds  wrote:

On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:


It's the *kernel* we don't want being able to access those pages,
because of the multitude of unfixable cache load gadgets.


Ahh.

I guess the proof is in the pudding. Did somebody try to forward-port
that patch set and see what the performance is like?


I've been spending some cycles on the XPFO patch set this week. For the
patch set as it was posted for v4.13, the performance overhead of
compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
completely from TLB flushing. If we can live with stale TLB entries
allowing temporary access (which I think is reasonable), we can remove
all TLB flushing (on x86). This reduces the overhead to 2-3% for
kernel compile.

There were no problems in forward-porting the patch set to master.
You can find the result here, including a patch makes the TLB flushing
configurable:
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

It survived some casual stress-ng runs. I can rerun the benchmarks on
this version, but I doubt there is any change.


It used to be just 500 LOC. Was that because they took horrible
shortcuts?


The patch is still fairly small. As for the horrible shortcuts, I let
others comment on that.



Looks like the performance impact can be whole lot worse. On my test 
system with 2 Xeon Platinum 8160 (HT enabled) CPUs and 768 GB of memory, 
I am seeing very high penalty with XPFO when building 4.18.6 kernel 
sources with "make -j60":


 No XPFO patch   XPFO patch(No TLB flush)  XPFO(TLB Flush)
sys time  52m 54.036s   55m 47.897s  434m 8.645s

That is ~8% worse with TLB flush disabled and ~720% worse with TLB flush 
enabled. This test was with kernel sources being compiled on an ext4 
filesystem. XPFO seems to affect ext2 even more. With ext2 filesystem, 
impact was ~18.6% and ~900%.


--
Khalid




Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-07 Thread Khalid Aziz

On 08/30/2018 10:00 AM, Julian Stecklina wrote:

Hey everyone,

On Mon, 20 Aug 2018 15:27 Linus Torvalds  wrote:

On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:


It's the *kernel* we don't want being able to access those pages,
because of the multitude of unfixable cache load gadgets.


Ahh.

I guess the proof is in the pudding. Did somebody try to forward-port
that patch set and see what the performance is like?


I've been spending some cycles on the XPFO patch set this week. For the
patch set as it was posted for v4.13, the performance overhead of
compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
completely from TLB flushing. If we can live with stale TLB entries
allowing temporary access (which I think is reasonable), we can remove
all TLB flushing (on x86). This reduces the overhead to 2-3% for
kernel compile.

There were no problems in forward-porting the patch set to master.
You can find the result here, including a patch makes the TLB flushing
configurable:
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

It survived some casual stress-ng runs. I can rerun the benchmarks on
this version, but I doubt there is any change.


It used to be just 500 LOC. Was that because they took horrible
shortcuts?


The patch is still fairly small. As for the horrible shortcuts, I let
others comment on that.



Looks like the performance impact can be whole lot worse. On my test 
system with 2 Xeon Platinum 8160 (HT enabled) CPUs and 768 GB of memory, 
I am seeing very high penalty with XPFO when building 4.18.6 kernel 
sources with "make -j60":


 No XPFO patch   XPFO patch(No TLB flush)  XPFO(TLB Flush)
sys time  52m 54.036s   55m 47.897s  434m 8.645s

That is ~8% worse with TLB flush disabled and ~720% worse with TLB flush 
enabled. This test was with kernel sources being compiled on an ext4 
filesystem. XPFO seems to affect ext2 even more. With ext2 filesystem, 
impact was ~18.6% and ~900%.


--
Khalid




Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-04 Thread Julian Stecklina
Andi Kleen  writes:

> On Sat, Sep 01, 2018 at 02:38:43PM -0700, Linus Torvalds wrote:
>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>> >
>> > I've been spending some cycles on the XPFO patch set this week. For the
>> > patch set as it was posted for v4.13, the performance overhead of
>> > compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>> > completely from TLB flushing. If we can live with stale TLB entries
>> > allowing temporary access (which I think is reasonable), we can remove
>> > all TLB flushing (on x86). This reduces the overhead to 2-3% for
>> > kernel compile.
>> 
>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>> 
>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>> from a kernel is not some small unnoticeable thing.
>
> Also the problem is that depending on the workload everything may fit
> into the TLBs, so the temporary stale TLB entries may be around
> for a long time. Modern CPUs have very large TLBs, and good
> LRU policies. For the kernel entries with global bit set and
> which are used for something there may be no reason ever to evict.
>
> Julian, I think you would need at least some quantitative perfmon data about
> TLB replacement rates in the kernel to show that it's "reasonable"
> instead of hand waving.

That's a fair point. It definitely depends on the workload. My idle
laptop gnome GUI session still causes ~40k dtlb-load-misses per second
per core. My idle server (some shells, IRC client) still has ~8k dTLB
load misses per second per core. Compiling something pushes this to
millions of misses per second.

For comparison according to https://www.7-cpu.com/cpu/Skylake_X.html SKX
can fit 1536 entries into its L2 dTLB.

> Most likely I suspect you would need a low frequency regular TLB
> flush for the global entries at least, which will increase
> the overhead again.

Given the tiny experiment above, I don't think this is necessary except
for highly special usecases. If stale TLB entries are a concern, the
better intermediate step is to do INVLPG on the core that modified the
page table.

And even with these shortcomings, XPFO severely limits the data an
attacker can leak from the kernel.

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-04 Thread Julian Stecklina
Andi Kleen  writes:

> On Sat, Sep 01, 2018 at 02:38:43PM -0700, Linus Torvalds wrote:
>> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>> >
>> > I've been spending some cycles on the XPFO patch set this week. For the
>> > patch set as it was posted for v4.13, the performance overhead of
>> > compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>> > completely from TLB flushing. If we can live with stale TLB entries
>> > allowing temporary access (which I think is reasonable), we can remove
>> > all TLB flushing (on x86). This reduces the overhead to 2-3% for
>> > kernel compile.
>> 
>> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
>> 
>> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
>> from a kernel is not some small unnoticeable thing.
>
> Also the problem is that depending on the workload everything may fit
> into the TLBs, so the temporary stale TLB entries may be around
> for a long time. Modern CPUs have very large TLBs, and good
> LRU policies. For the kernel entries with global bit set and
> which are used for something there may be no reason ever to evict.
>
> Julian, I think you would need at least some quantitative perfmon data about
> TLB replacement rates in the kernel to show that it's "reasonable"
> instead of hand waving.

That's a fair point. It definitely depends on the workload. My idle
laptop gnome GUI session still causes ~40k dtlb-load-misses per second
per core. My idle server (some shells, IRC client) still has ~8k dTLB
load misses per second per core. Compiling something pushes this to
millions of misses per second.

For comparison according to https://www.7-cpu.com/cpu/Skylake_X.html SKX
can fit 1536 entries into its L2 dTLB.

> Most likely I suspect you would need a low frequency regular TLB
> flush for the global entries at least, which will increase
> the overhead again.

Given the tiny experiment above, I don't think this is necessary except
for highly special usecases. If stale TLB entries are a concern, the
better intermediate step is to do INVLPG on the core that modified the
page table.

And even with these shortcomings, XPFO severely limits the data an
attacker can leak from the kernel.

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-03 Thread Andi Kleen
On Sat, Sep 01, 2018 at 06:33:22PM -0400, Wes Turner wrote:
>Speaking of pages and slowdowns,
>is there a better place to ask this question:
>From "'Turning Tables' shared page tables vuln":
>"""
>'New "Turning Tables" Technique Bypasses All Windows Kernel Mitigations'
>
> https://www.bleepingcomputer.com/news/security/new-turning-tables-technique-bypasses-all-windows-kernel-mitigations/
>> Furthermore, since the concept of page tables is also used by Apple and
>the Linux project, macOS and Linux are, in theory, also vulnerable to this
>technique, albeit the researchers have not verified such attacks, as of
>yet.
>Slides:
>https://cdn2.hubspot.net/hubfs/487909/Turning%20(Page)%20Tables_Slides.pdf
>Naturally, I took notice and decided to forward the latest scary headline
>to this list to see if this is already being addressed?

This essentially just says that if you can change page tables you can subvert 
kernels.
That's always been the case, always will be, I'm sure has been used forever by 
root kits,
and I don't know why anybody would pass it off as a "new attack".

-Andi


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-03 Thread Andi Kleen
On Sat, Sep 01, 2018 at 06:33:22PM -0400, Wes Turner wrote:
>Speaking of pages and slowdowns,
>is there a better place to ask this question:
>From "'Turning Tables' shared page tables vuln":
>"""
>'New "Turning Tables" Technique Bypasses All Windows Kernel Mitigations'
>
> https://www.bleepingcomputer.com/news/security/new-turning-tables-technique-bypasses-all-windows-kernel-mitigations/
>> Furthermore, since the concept of page tables is also used by Apple and
>the Linux project, macOS and Linux are, in theory, also vulnerable to this
>technique, albeit the researchers have not verified such attacks, as of
>yet.
>Slides:
>https://cdn2.hubspot.net/hubfs/487909/Turning%20(Page)%20Tables_Slides.pdf
>Naturally, I took notice and decided to forward the latest scary headline
>to this list to see if this is already being addressed?

This essentially just says that if you can change page tables you can subvert 
kernels.
That's always been the case, always will be, I'm sure has been used forever by 
root kits,
and I don't know why anybody would pass it off as a "new attack".

-Andi


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-03 Thread Andi Kleen
On Sat, Sep 01, 2018 at 02:38:43PM -0700, Linus Torvalds wrote:
> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
> >
> > I've been spending some cycles on the XPFO patch set this week. For the
> > patch set as it was posted for v4.13, the performance overhead of
> > compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> > completely from TLB flushing. If we can live with stale TLB entries
> > allowing temporary access (which I think is reasonable), we can remove
> > all TLB flushing (on x86). This reduces the overhead to 2-3% for
> > kernel compile.
> 
> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
> 
> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
> from a kernel is not some small unnoticeable thing.

Also the problem is that depending on the workload everything may fit
into the TLBs, so the temporary stale TLB entries may be around
for a long time. Modern CPUs have very large TLBs, and good
LRU policies. For the kernel entries with global bit set and
which are used for something there may be no reason ever to evict.

Julian, I think you would need at least some quantitative perfmon data about
TLB replacement rates in the kernel to show that it's "reasonable"
instead of hand waving.

Most likely I suspect you would need a low frequency regular TLB
flush for the global entries at least, which will increase
the overhead again.

-Andi



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-03 Thread Andi Kleen
On Sat, Sep 01, 2018 at 02:38:43PM -0700, Linus Torvalds wrote:
> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
> >
> > I've been spending some cycles on the XPFO patch set this week. For the
> > patch set as it was posted for v4.13, the performance overhead of
> > compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> > completely from TLB flushing. If we can live with stale TLB entries
> > allowing temporary access (which I think is reasonable), we can remove
> > all TLB flushing (on x86). This reduces the overhead to 2-3% for
> > kernel compile.
> 
> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.
> 
> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
> from a kernel is not some small unnoticeable thing.

Also the problem is that depending on the workload everything may fit
into the TLBs, so the temporary stale TLB entries may be around
for a long time. Modern CPUs have very large TLBs, and good
LRU policies. For the kernel entries with global bit set and
which are used for something there may be no reason ever to evict.

Julian, I think you would need at least some quantitative perfmon data about
TLB replacement rates in the kernel to show that it's "reasonable"
instead of hand waving.

Most likely I suspect you would need a low frequency regular TLB
flush for the global entries at least, which will increase
the overhead again.

-Andi



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-03 Thread Julian Stecklina
Linus Torvalds  writes:

> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>>
>> I've been spending some cycles on the XPFO patch set this week. For the
>> patch set as it was posted for v4.13, the performance overhead of
>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>> completely from TLB flushing. If we can live with stale TLB entries
>> allowing temporary access (which I think is reasonable), we can remove
>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>> kernel compile.
>
> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.

Well, it's at least in a range where it doesn't look hopeless.

> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
> from a kernel is not some small unnoticeable thing.

The overhead seems to come from the hooks that XPFO adds to
alloc/free_pages. These hooks add a couple of atomic operations per
allocated (4K) page for book keeping. Some of these atomic ops are only
for debugging and could be removed. There is also some opportunity to
streamline the per-page space overhead of XPFO.

I'll do some more in-depth profiling later this week.

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-03 Thread Julian Stecklina
Linus Torvalds  writes:

> On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>>
>> I've been spending some cycles on the XPFO patch set this week. For the
>> patch set as it was posted for v4.13, the performance overhead of
>> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
>> completely from TLB flushing. If we can live with stale TLB entries
>> allowing temporary access (which I think is reasonable), we can remove
>> all TLB flushing (on x86). This reduces the overhead to 2-3% for
>> kernel compile.
>
> I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.

Well, it's at least in a range where it doesn't look hopeless.

> Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
> from a kernel is not some small unnoticeable thing.

The overhead seems to come from the hooks that XPFO adds to
alloc/free_pages. These hooks add a couple of atomic operations per
allocated (4K) page for book keeping. Some of these atomic ops are only
for debugging and could be removed. There is also some opportunity to
streamline the per-page space overhead of XPFO.

I'll do some more in-depth profiling later this week.

Julian
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-01 Thread Linus Torvalds
On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>
> I've been spending some cycles on the XPFO patch set this week. For the
> patch set as it was posted for v4.13, the performance overhead of
> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> completely from TLB flushing. If we can live with stale TLB entries
> allowing temporary access (which I think is reasonable), we can remove
> all TLB flushing (on x86). This reduces the overhead to 2-3% for
> kernel compile.

I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.

Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
from a kernel is not some small unnoticeable thing.

   Linus


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-09-01 Thread Linus Torvalds
On Fri, Aug 31, 2018 at 12:45 AM Julian Stecklina  wrote:
>
> I've been spending some cycles on the XPFO patch set this week. For the
> patch set as it was posted for v4.13, the performance overhead of
> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> completely from TLB flushing. If we can live with stale TLB entries
> allowing temporary access (which I think is reasonable), we can remove
> all TLB flushing (on x86). This reduces the overhead to 2-3% for
> kernel compile.

I have to say, even 2-3% for a kernel compile sounds absolutely horrendous.

Kernel bullds are 90% user space at least for me, so a 2-3% slowdown
from a kernel is not some small unnoticeable thing.

   Linus


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-31 Thread Tycho Andersen
On Thu, Aug 30, 2018 at 06:00:51PM +0200, Julian Stecklina wrote:
> Hey everyone,
> 
> On Mon, 20 Aug 2018 15:27 Linus Torvalds  
> wrote:
> > On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
> >>
> >> It's the *kernel* we don't want being able to access those pages,
> >> because of the multitude of unfixable cache load gadgets.
> >
> > Ahh.
> > 
> > I guess the proof is in the pudding. Did somebody try to forward-port
> > that patch set and see what the performance is like?
> 
> I've been spending some cycles on the XPFO patch set this week. For the
> patch set as it was posted for v4.13, the performance overhead of
> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> completely from TLB flushing. If we can live with stale TLB entries
> allowing temporary access (which I think is reasonable), we can remove
> all TLB flushing (on x86). This reduces the overhead to 2-3% for
> kernel compile.

Cool, thanks for doing this! Do you have any thoughts about what the
2-3% is? It seems to me like if we're not doing the TLB flushes, the
rest of this should be *really* cheap, even cheaper than 2-3%. Dave
Hansen had suggested coalescing things on a per mapping basis vs.
doing it per page, which might help?

> > It used to be just 500 LOC. Was that because they took horrible
> > shortcuts?
> 
> The patch is still fairly small. As for the horrible shortcuts, I let
> others comment on that.

Heh, things like xpfo_temp_map() aren't awesome, but that can
hopefully be fixed by keeping a little bit of memory around for use
where we are mapping things and can't fail. I remember some discussion
about hopefully not having to sprinkle xpfo mapping calls everywhere
in the kernel, so perhaps we could get rid of it entirely?

Anyway, I'm working on some other stuff for the kernel right now, but
I hope (:D) that it should be close to done, and I'll have more cycles
to work on this soon too.

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-31 Thread Tycho Andersen
On Thu, Aug 30, 2018 at 06:00:51PM +0200, Julian Stecklina wrote:
> Hey everyone,
> 
> On Mon, 20 Aug 2018 15:27 Linus Torvalds  
> wrote:
> > On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
> >>
> >> It's the *kernel* we don't want being able to access those pages,
> >> because of the multitude of unfixable cache load gadgets.
> >
> > Ahh.
> > 
> > I guess the proof is in the pudding. Did somebody try to forward-port
> > that patch set and see what the performance is like?
> 
> I've been spending some cycles on the XPFO patch set this week. For the
> patch set as it was posted for v4.13, the performance overhead of
> compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
> completely from TLB flushing. If we can live with stale TLB entries
> allowing temporary access (which I think is reasonable), we can remove
> all TLB flushing (on x86). This reduces the overhead to 2-3% for
> kernel compile.

Cool, thanks for doing this! Do you have any thoughts about what the
2-3% is? It seems to me like if we're not doing the TLB flushes, the
rest of this should be *really* cheap, even cheaper than 2-3%. Dave
Hansen had suggested coalescing things on a per mapping basis vs.
doing it per page, which might help?

> > It used to be just 500 LOC. Was that because they took horrible
> > shortcuts?
> 
> The patch is still fairly small. As for the horrible shortcuts, I let
> others comment on that.

Heh, things like xpfo_temp_map() aren't awesome, but that can
hopefully be fixed by keeping a little bit of memory around for use
where we are mapping things and can't fail. I remember some discussion
about hopefully not having to sprinkle xpfo mapping calls everywhere
in the kernel, so perhaps we could get rid of it entirely?

Anyway, I'm working on some other stuff for the kernel right now, but
I hope (:D) that it should be close to done, and I'll have more cycles
to work on this soon too.

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-31 Thread James Bottomley
On Mon, 2018-08-20 at 21:52 +, Woodhouse, David wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > 
> > Of course, after the long (and entirely unrelated) discussion about
> > the TLB flushing bug we had, I'm starting to worry about my own
> > competence, and maybe I'm missing something really fundamental, and
> > the XPFO patches do something else than what I think they do, or my
> > "hey, let's use our Meltdown code" idea has some fundamental
> > weakness
> > that I'm missing.
> 
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
> 
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

A long time ago, I gave a talk about precisely this at OLS (2005 I
think).  On PA-RISC we have a problem with inequivalent aliasing in the
 page cache (same physical page with two different virtual addresses
modulo 4MB) which causes a machine check if it occurs. 
Architecturally, PA can move into the cache any page for which it has a
mapping and the kernel offset map of every page causes an inequivalency
if the same page is in use in user space.  Of course, practically the
caching machinery is too busy moving in and out pages we reference to
have an interest in speculating on other pages it has a mapping for, so
it almost never (the almost being a set of machine checks we see very
occasionally in the latest and most aggressively cached and speculating
CPUs).  If this were implemented, we'd be interested in using it.

James



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-31 Thread James Bottomley
On Mon, 2018-08-20 at 21:52 +, Woodhouse, David wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
> > 
> > Of course, after the long (and entirely unrelated) discussion about
> > the TLB flushing bug we had, I'm starting to worry about my own
> > competence, and maybe I'm missing something really fundamental, and
> > the XPFO patches do something else than what I think they do, or my
> > "hey, let's use our Meltdown code" idea has some fundamental
> > weakness
> > that I'm missing.
> 
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
> 
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

A long time ago, I gave a talk about precisely this at OLS (2005 I
think).  On PA-RISC we have a problem with inequivalent aliasing in the
 page cache (same physical page with two different virtual addresses
modulo 4MB) which causes a machine check if it occurs. 
Architecturally, PA can move into the cache any page for which it has a
mapping and the kernel offset map of every page causes an inequivalency
if the same page is in use in user space.  Of course, practically the
caching machinery is too busy moving in and out pages we reference to
have an interest in speculating on other pages it has a mapping for, so
it almost never (the almost being a set of machine checks we see very
occasionally in the latest and most aggressively cached and speculating
CPUs).  If this were implemented, we'd be interested in using it.

James



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-31 Thread Julian Stecklina
Hey everyone,

On Mon, 20 Aug 2018 15:27 Linus Torvalds  wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
>>
>> It's the *kernel* we don't want being able to access those pages,
>> because of the multitude of unfixable cache load gadgets.
>
> Ahh.
> 
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?

I've been spending some cycles on the XPFO patch set this week. For the
patch set as it was posted for v4.13, the performance overhead of
compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
completely from TLB flushing. If we can live with stale TLB entries
allowing temporary access (which I think is reasonable), we can remove
all TLB flushing (on x86). This reduces the overhead to 2-3% for
kernel compile.

There were no problems in forward-porting the patch set to master.
You can find the result here, including a patch makes the TLB flushing
configurable:
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

It survived some casual stress-ng runs. I can rerun the benchmarks on
this version, but I doubt there is any change.

> It used to be just 500 LOC. Was that because they took horrible
> shortcuts?

The patch is still fairly small. As for the horrible shortcuts, I let
others comment on that.

HTH,
Julian

[1] Measured on my quad-core (8 hyperthreads) Kaby Lake desktop building
Linux 4.18 with the Phoronix Test Suite.

--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-31 Thread Julian Stecklina
Hey everyone,

On Mon, 20 Aug 2018 15:27 Linus Torvalds  wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
>>
>> It's the *kernel* we don't want being able to access those pages,
>> because of the multitude of unfixable cache load gadgets.
>
> Ahh.
> 
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?

I've been spending some cycles on the XPFO patch set this week. For the
patch set as it was posted for v4.13, the performance overhead of
compiling a Linux kernel is ~40% on x86_64[1]. The overhead comes almost
completely from TLB flushing. If we can live with stale TLB entries
allowing temporary access (which I think is reasonable), we can remove
all TLB flushing (on x86). This reduces the overhead to 2-3% for
kernel compile.

There were no problems in forward-porting the patch set to master.
You can find the result here, including a patch makes the TLB flushing
configurable:
http://git.infradead.org/users/jsteckli/linux-xpfo.git/shortlog/refs/heads/xpfo-master

It survived some casual stress-ng runs. I can rerun the benchmarks on
this version, but I doubt there is any change.

> It used to be just 500 LOC. Was that because they took horrible
> shortcuts?

The patch is still fairly small. As for the horrible shortcuts, I let
others comment on that.

HTH,
Julian

[1] Measured on my quad-core (8 hyperthreads) Kaby Lake desktop building
Linux 4.18 with the Phoronix Test Suite.

--
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon



> On 21 Aug 2018, at 17:22, David Woodhouse  wrote:
> 
> On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
>> 
>>> On 21 Aug 2018, at 12:57, David Woodhouse 
>> wrote:
>>>  
>>> Another alternative... I'm told POWER8 does an interesting thing
>> with
>>> hyperthreading and gang scheduling for KVM. The host kernel doesn't
>>> actually *see* the hyperthreads at all, and KVM just launches the
>> full
>>> set of siblings when it enters a guest, and gathers them again when
>> any
>>> of them exits. That's definitely worth investigating as an option
>> for
>>> x86, too.
>> 
>> I actually think that such scheduling mechanism which prevents
>> leaking cache entries to sibling hyperthreads should co-exist
>> together with the KVM address space isolation to fully mitigate L1TF
>> and other similar vulnerabilities. The address space isolation should
>> prevent VMExit handlers code gadgets from loading arbitrary host
>> memory to the cache. Once VMExit code path switches to full host
>> address space, then we should also make sure that no other sibling
>> hyprethread is running in the guest.
> 
> The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
> The siblings are *never* running host kernel code; they're all torn
> down when any of them exits the guest. And it's always the *same*
> guest.
> 

I wasn’t aware of this KVM Power8 mechanism. Thanks for the pointer.
(371fefd6f2dc ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes”))

Note though that my point regarding the co-existence of the isolated address 
space together with such scheduling mechanism is still valid.
The scheduling mechanism should not be seen as an alternative to the isolated 
address space if we wish to reduce the frequency of events
in which we need to kick sibling hyperthreads from guest.

>> Focusing on the scheduling mechanism, we must make sure that when a
>> logical processor runs guest code, all siblings logical processors
>> must run code which do not populate L1D cache with information
>> unrelated to this VM. This includes forbidding one logical processor
>> to run guest code while sibling is running a host task such as a NIC
>> interrupt handler.
>> Thus, when a vCPU thread exits the guest into the host and VMExit
>> handler reaches code flow which could populate L1D cache with this
>> information, we should force an exit from the guest of the siblings
>> logical processors, such that they will be allowed to resume only on
>> a core which we can promise that the L1D cache is free from
>> information unrelated to this VM.
>> 
>> At first, I have created a patch series which attempts to implement
>> such mechanism in KVM. However, it became clear to me that this may
>> need to be implemented in the scheduler itself. This is because:
>> 1. It is difficult to handle all new scheduling contrains only in
>> KVM.
>> 2. This mechanism should be relevant for any Type-2 hypervisor which
>> runs inside Linux besides KVM (Such as VMware Workstation or
>> VirtualBox).
>> 3. This mechanism could also be used to prevent future “core-cache-
>> leaking” vulnerabilities to be exploited between processes of
>> different security domains which run as siblings on the same core.
> 
> I'm not sure I agree. If KVM is handling "only let siblings run the
> *same* guest" and the siblings aren't visible to the host at all,
> that's quite simple. Any other hypervisor can also do it.
> 
> Now, the down-side of this is that the siblings aren't visible to the
> host. They can't be used to run multiple threads of the same userspace
> processes; only multiple threads of the same KVM guest. A truly generic
> core scheduler would cope with userspace threads too.
> 
> BUT I strongly suspect there's a huge correlation between the set of
> people who care enough about the KVM/L1TF issue to enable a costly
> XFPO-like solution, and the set of people who mostly don't give a shit
> about having sibling CPUs available to run the host's userspace anyway.
> 
> This is not the "I happen to run a Windows VM on my Linux desktop" use
> case...

If I understand your proposal correctly, you suggest to do something similar to 
the KVM Power8 solution:
1. Disable HyperThreading for use by host user space.
2. Use sibling hyperthreads only in KVM and schedule group of vCPUs that run on 
a single core as a “gang” to enter and exit guest together.

This solution may work well for KVM-based cloud providers that match the 
following criteria:
1. All compute instances run with SR-IOV and IOMMU Posted-Interrupts.
2. Configure affinity such that host dedicate distinct set of physical cores 
per guest. No physical core is able to run vCPUs from multiple guests.

However, this may not necessarily be the case: Some cloud providers have 
compute instances which all their devices are emulated or ParaVirtualized.
In the proposed scheduling mechanism, all the IOThreads of these guests will 
not be able to utilize HyperThreading which can be a significant performance 

Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon



> On 21 Aug 2018, at 17:22, David Woodhouse  wrote:
> 
> On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
>> 
>>> On 21 Aug 2018, at 12:57, David Woodhouse 
>> wrote:
>>>  
>>> Another alternative... I'm told POWER8 does an interesting thing
>> with
>>> hyperthreading and gang scheduling for KVM. The host kernel doesn't
>>> actually *see* the hyperthreads at all, and KVM just launches the
>> full
>>> set of siblings when it enters a guest, and gathers them again when
>> any
>>> of them exits. That's definitely worth investigating as an option
>> for
>>> x86, too.
>> 
>> I actually think that such scheduling mechanism which prevents
>> leaking cache entries to sibling hyperthreads should co-exist
>> together with the KVM address space isolation to fully mitigate L1TF
>> and other similar vulnerabilities. The address space isolation should
>> prevent VMExit handlers code gadgets from loading arbitrary host
>> memory to the cache. Once VMExit code path switches to full host
>> address space, then we should also make sure that no other sibling
>> hyprethread is running in the guest.
> 
> The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
> The siblings are *never* running host kernel code; they're all torn
> down when any of them exits the guest. And it's always the *same*
> guest.
> 

I wasn’t aware of this KVM Power8 mechanism. Thanks for the pointer.
(371fefd6f2dc ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes”))

Note though that my point regarding the co-existence of the isolated address 
space together with such scheduling mechanism is still valid.
The scheduling mechanism should not be seen as an alternative to the isolated 
address space if we wish to reduce the frequency of events
in which we need to kick sibling hyperthreads from guest.

>> Focusing on the scheduling mechanism, we must make sure that when a
>> logical processor runs guest code, all siblings logical processors
>> must run code which do not populate L1D cache with information
>> unrelated to this VM. This includes forbidding one logical processor
>> to run guest code while sibling is running a host task such as a NIC
>> interrupt handler.
>> Thus, when a vCPU thread exits the guest into the host and VMExit
>> handler reaches code flow which could populate L1D cache with this
>> information, we should force an exit from the guest of the siblings
>> logical processors, such that they will be allowed to resume only on
>> a core which we can promise that the L1D cache is free from
>> information unrelated to this VM.
>> 
>> At first, I have created a patch series which attempts to implement
>> such mechanism in KVM. However, it became clear to me that this may
>> need to be implemented in the scheduler itself. This is because:
>> 1. It is difficult to handle all new scheduling contrains only in
>> KVM.
>> 2. This mechanism should be relevant for any Type-2 hypervisor which
>> runs inside Linux besides KVM (Such as VMware Workstation or
>> VirtualBox).
>> 3. This mechanism could also be used to prevent future “core-cache-
>> leaking” vulnerabilities to be exploited between processes of
>> different security domains which run as siblings on the same core.
> 
> I'm not sure I agree. If KVM is handling "only let siblings run the
> *same* guest" and the siblings aren't visible to the host at all,
> that's quite simple. Any other hypervisor can also do it.
> 
> Now, the down-side of this is that the siblings aren't visible to the
> host. They can't be used to run multiple threads of the same userspace
> processes; only multiple threads of the same KVM guest. A truly generic
> core scheduler would cope with userspace threads too.
> 
> BUT I strongly suspect there's a huge correlation between the set of
> people who care enough about the KVM/L1TF issue to enable a costly
> XFPO-like solution, and the set of people who mostly don't give a shit
> about having sibling CPUs available to run the host's userspace anyway.
> 
> This is not the "I happen to run a Windows VM on my Linux desktop" use
> case...

If I understand your proposal correctly, you suggest to do something similar to 
the KVM Power8 solution:
1. Disable HyperThreading for use by host user space.
2. Use sibling hyperthreads only in KVM and schedule group of vCPUs that run on 
a single core as a “gang” to enter and exit guest together.

This solution may work well for KVM-based cloud providers that match the 
following criteria:
1. All compute instances run with SR-IOV and IOMMU Posted-Interrupts.
2. Configure affinity such that host dedicate distinct set of physical cores 
per guest. No physical core is able to run vCPUs from multiple guests.

However, this may not necessarily be the case: Some cloud providers have 
compute instances which all their devices are emulated or ParaVirtualized.
In the proposed scheduling mechanism, all the IOThreads of these guests will 
not be able to utilize HyperThreading which can be a significant performance 

Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread David Woodhouse
On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
> 
> > On 21 Aug 2018, at 12:57, David Woodhouse 
> wrote:
> > 
> > Another alternative... I'm told POWER8 does an interesting thing
> with
> > hyperthreading and gang scheduling for KVM. The host kernel doesn't
> > actually *see* the hyperthreads at all, and KVM just launches the
> full
> > set of siblings when it enters a guest, and gathers them again when
> any
> > of them exits. That's definitely worth investigating as an option
> for
> > x86, too.
> 
> I actually think that such scheduling mechanism which prevents
> leaking cache entries to sibling hyperthreads should co-exist
> together with the KVM address space isolation to fully mitigate L1TF
> and other similar vulnerabilities. The address space isolation should
> prevent VMExit handlers code gadgets from loading arbitrary host
> memory to the cache. Once VMExit code path switches to full host
> address space, then we should also make sure that no other sibling
> hyprethread is running in the guest.

The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
The siblings are *never* running host kernel code; they're all torn
down when any of them exits the guest. And it's always the *same*
guest.

> Focusing on the scheduling mechanism, we must make sure that when a
> logical processor runs guest code, all siblings logical processors
> must run code which do not populate L1D cache with information
> unrelated to this VM. This includes forbidding one logical processor
> to run guest code while sibling is running a host task such as a NIC
> interrupt handler.
> Thus, when a vCPU thread exits the guest into the host and VMExit
> handler reaches code flow which could populate L1D cache with this
> information, we should force an exit from the guest of the siblings
> logical processors, such that they will be allowed to resume only on
> a core which we can promise that the L1D cache is free from
> information unrelated to this VM.
> 
> At first, I have created a patch series which attempts to implement
> such mechanism in KVM. However, it became clear to me that this may
> need to be implemented in the scheduler itself. This is because:
> 1. It is difficult to handle all new scheduling contrains only in
> KVM.
> 2. This mechanism should be relevant for any Type-2 hypervisor which
> runs inside Linux besides KVM (Such as VMware Workstation or
> VirtualBox).
> 3. This mechanism could also be used to prevent future “core-cache-
> leaking” vulnerabilities to be exploited between processes of
> different security domains which run as siblings on the same core.

I'm not sure I agree. If KVM is handling "only let siblings run the
*same* guest" and the siblings aren't visible to the host at all,
that's quite simple. Any other hypervisor can also do it.

Now, the down-side of this is that the siblings aren't visible to the
host. They can't be used to run multiple threads of the same userspace
processes; only multiple threads of the same KVM guest. A truly generic
core scheduler would cope with userspace threads too.

BUT I strongly suspect there's a huge correlation between the set of
people who care enough about the KVM/L1TF issue to enable a costly
XFPO-like solution, and the set of people who mostly don't give a shit
about having sibling CPUs available to run the host's userspace anyway.

This is not the "I happen to run a Windows VM on my Linux desktop" use
case...

smime.p7s
Description: S/MIME cryptographic signature


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread David Woodhouse
On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
> 
> > On 21 Aug 2018, at 12:57, David Woodhouse 
> wrote:
> > 
> > Another alternative... I'm told POWER8 does an interesting thing
> with
> > hyperthreading and gang scheduling for KVM. The host kernel doesn't
> > actually *see* the hyperthreads at all, and KVM just launches the
> full
> > set of siblings when it enters a guest, and gathers them again when
> any
> > of them exits. That's definitely worth investigating as an option
> for
> > x86, too.
> 
> I actually think that such scheduling mechanism which prevents
> leaking cache entries to sibling hyperthreads should co-exist
> together with the KVM address space isolation to fully mitigate L1TF
> and other similar vulnerabilities. The address space isolation should
> prevent VMExit handlers code gadgets from loading arbitrary host
> memory to the cache. Once VMExit code path switches to full host
> address space, then we should also make sure that no other sibling
> hyprethread is running in the guest.

The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
The siblings are *never* running host kernel code; they're all torn
down when any of them exits the guest. And it's always the *same*
guest.

> Focusing on the scheduling mechanism, we must make sure that when a
> logical processor runs guest code, all siblings logical processors
> must run code which do not populate L1D cache with information
> unrelated to this VM. This includes forbidding one logical processor
> to run guest code while sibling is running a host task such as a NIC
> interrupt handler.
> Thus, when a vCPU thread exits the guest into the host and VMExit
> handler reaches code flow which could populate L1D cache with this
> information, we should force an exit from the guest of the siblings
> logical processors, such that they will be allowed to resume only on
> a core which we can promise that the L1D cache is free from
> information unrelated to this VM.
> 
> At first, I have created a patch series which attempts to implement
> such mechanism in KVM. However, it became clear to me that this may
> need to be implemented in the scheduler itself. This is because:
> 1. It is difficult to handle all new scheduling contrains only in
> KVM.
> 2. This mechanism should be relevant for any Type-2 hypervisor which
> runs inside Linux besides KVM (Such as VMware Workstation or
> VirtualBox).
> 3. This mechanism could also be used to prevent future “core-cache-
> leaking” vulnerabilities to be exploited between processes of
> different security domains which run as siblings on the same core.

I'm not sure I agree. If KVM is handling "only let siblings run the
*same* guest" and the siblings aren't visible to the host at all,
that's quite simple. Any other hypervisor can also do it.

Now, the down-side of this is that the siblings aren't visible to the
host. They can't be used to run multiple threads of the same userspace
processes; only multiple threads of the same KVM guest. A truly generic
core scheduler would cope with userspace threads too.

BUT I strongly suspect there's a huge correlation between the set of
people who care enough about the KVM/L1TF issue to enable a costly
XFPO-like solution, and the set of people who mostly don't give a shit
about having sibling CPUs available to run the host's userspace anyway.

This is not the "I happen to run a Windows VM on my Linux desktop" use
case...

smime.p7s
Description: S/MIME cryptographic signature


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon


> On 21 Aug 2018, at 12:57, David Woodhouse  wrote:
> 
> Another alternative... I'm told POWER8 does an interesting thing with
> hyperthreading and gang scheduling for KVM. The host kernel doesn't
> actually *see* the hyperthreads at all, and KVM just launches the full
> set of siblings when it enters a guest, and gathers them again when any
> of them exits. That's definitely worth investigating as an option for
> x86, too.

I actually think that such scheduling mechanism which prevents leaking cache 
entries to sibling hyperthreads should co-exist together with the KVM address 
space isolation to fully mitigate L1TF and other similar vulnerabilities. The 
address space isolation should prevent VMExit handlers code gadgets from 
loading arbitrary host memory to the cache. Once VMExit code path switches to 
full host address space, then we should also make sure that no other sibling 
hyprethread is running in the guest.

Focusing on the scheduling mechanism, we must make sure that when a logical 
processor runs guest code, all siblings logical processors must run code which 
do not populate L1D cache with information unrelated to this VM. This includes 
forbidding one logical processor to run guest code while sibling is running a 
host task such as a NIC interrupt handler.
Thus, when a vCPU thread exits the guest into the host and VMExit handler 
reaches code flow which could populate L1D cache with this information, we 
should force an exit from the guest of the siblings logical processors, such 
that they will be allowed to resume only on a core which we can promise that 
the L1D cache is free from information unrelated to this VM.

At first, I have created a patch series which attempts to implement such 
mechanism in KVM. However, it became clear to me that this may need to be 
implemented in the scheduler itself. This is because:
1. It is difficult to handle all new scheduling contrains only in KVM.
2. This mechanism should be relevant for any Type-2 hypervisor which runs 
inside Linux besides KVM (Such as VMware Workstation or VirtualBox).
3. This mechanism could also be used to prevent future “core-cache-leaking” 
vulnerabilities to be exploited between processes of different security domains 
which run as siblings on the same core.

The main idea is a mechanism which is very similar to Microsoft's "core 
scheduler" which they implemented to mitigate this vulnerability. The mechanism 
should work as follows:
1. Each CPU core will now be tagged with a "security domain id".
2. The scheduler will provide a mechanism to tag a task with a security domain 
id.
3. Tasks will inherit their security domain id from their parent task.
3.1. First task in system will have security domain id of 0. Thus, if 
nothing special is done, all tasks will be assigned with security domain id of 
0.
4. Tasks will be able to allocate a new security domain id from the scheduler 
and assign it to another task dynamically.
5. Linux scheduler will prevent scheduling tasks on a core with a different 
security domain id:
5.0. CPU core security domain id will be set to the security domain id of 
the tasks which currently run on it.
5.1. The scheduler will attempt to first schedule a task on a core with 
required security domain id if such exists.
5.2. Otherwise, will need to decide if it wishes to kick all tasks running 
on some core to run the task with a different security domain id on that core.

The above mechanism can be used to mitigate the L1TF HT variant by just 
assigning vCPU tasks with a security domain id which is unique per VM and also 
different than the security domain id of the host which is 0.

I would be glad to hear feedback on the above suggestion.
If this should better be discussed on a separate email thread, please say so 
and I will open a new thread.

Thanks,
-Liran




Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon


> On 21 Aug 2018, at 12:57, David Woodhouse  wrote:
> 
> Another alternative... I'm told POWER8 does an interesting thing with
> hyperthreading and gang scheduling for KVM. The host kernel doesn't
> actually *see* the hyperthreads at all, and KVM just launches the full
> set of siblings when it enters a guest, and gathers them again when any
> of them exits. That's definitely worth investigating as an option for
> x86, too.

I actually think that such scheduling mechanism which prevents leaking cache 
entries to sibling hyperthreads should co-exist together with the KVM address 
space isolation to fully mitigate L1TF and other similar vulnerabilities. The 
address space isolation should prevent VMExit handlers code gadgets from 
loading arbitrary host memory to the cache. Once VMExit code path switches to 
full host address space, then we should also make sure that no other sibling 
hyprethread is running in the guest.

Focusing on the scheduling mechanism, we must make sure that when a logical 
processor runs guest code, all siblings logical processors must run code which 
do not populate L1D cache with information unrelated to this VM. This includes 
forbidding one logical processor to run guest code while sibling is running a 
host task such as a NIC interrupt handler.
Thus, when a vCPU thread exits the guest into the host and VMExit handler 
reaches code flow which could populate L1D cache with this information, we 
should force an exit from the guest of the siblings logical processors, such 
that they will be allowed to resume only on a core which we can promise that 
the L1D cache is free from information unrelated to this VM.

At first, I have created a patch series which attempts to implement such 
mechanism in KVM. However, it became clear to me that this may need to be 
implemented in the scheduler itself. This is because:
1. It is difficult to handle all new scheduling contrains only in KVM.
2. This mechanism should be relevant for any Type-2 hypervisor which runs 
inside Linux besides KVM (Such as VMware Workstation or VirtualBox).
3. This mechanism could also be used to prevent future “core-cache-leaking” 
vulnerabilities to be exploited between processes of different security domains 
which run as siblings on the same core.

The main idea is a mechanism which is very similar to Microsoft's "core 
scheduler" which they implemented to mitigate this vulnerability. The mechanism 
should work as follows:
1. Each CPU core will now be tagged with a "security domain id".
2. The scheduler will provide a mechanism to tag a task with a security domain 
id.
3. Tasks will inherit their security domain id from their parent task.
3.1. First task in system will have security domain id of 0. Thus, if 
nothing special is done, all tasks will be assigned with security domain id of 
0.
4. Tasks will be able to allocate a new security domain id from the scheduler 
and assign it to another task dynamically.
5. Linux scheduler will prevent scheduling tasks on a core with a different 
security domain id:
5.0. CPU core security domain id will be set to the security domain id of 
the tasks which currently run on it.
5.1. The scheduler will attempt to first schedule a task on a core with 
required security domain id if such exists.
5.2. Otherwise, will need to decide if it wishes to kick all tasks running 
on some core to run the task with a different security domain id on that core.

The above mechanism can be used to mitigate the L1TF HT variant by just 
assigning vCPU tasks with a security domain id which is unique per VM and also 
different than the security domain id of the host which is 0.

I would be glad to hear feedback on the above suggestion.
If this should better be discussed on a separate email thread, please say so 
and I will open a new thread.

Thanks,
-Liran




Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread David Woodhouse
On Mon, 2018-08-20 at 15:27 -0700, Linus Torvalds wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
> 
> Ahh.
> 
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?

I hadn't actually seen the XPFO patch set before; we're going to take a
serious look.

Of course, this is only really something that a select few people (with
quite a lot of machines) would turn on. And they might be willing to
tolerate a significant performance cost if the alternative way to be
safe is to disable hyperthreading entirely — which is Intel's best
recommendation so far, it seems.

Another alternative... I'm told POWER8 does an interesting thing with
hyperthreading and gang scheduling for KVM. The host kernel doesn't
actually *see* the hyperthreads at all, and KVM just launches the full
set of siblings when it enters a guest, and gathers them again when any
of them exits. That's definitely worth investigating as an option for
x86, too.


smime.p7s
Description: S/MIME cryptographic signature


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread David Woodhouse
On Mon, 2018-08-20 at 15:27 -0700, Linus Torvalds wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
> 
> Ahh.
> 
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?

I hadn't actually seen the XPFO patch set before; we're going to take a
serious look.

Of course, this is only really something that a select few people (with
quite a lot of machines) would turn on. And they might be willing to
tolerate a significant performance cost if the alternative way to be
safe is to disable hyperthreading entirely — which is Intel's best
recommendation so far, it seems.

Another alternative... I'm told POWER8 does an interesting thing with
hyperthreading and gang scheduling for KVM. The host kernel doesn't
actually *see* the hyperthreads at all, and KVM just launches the full
set of siblings when it enters a guest, and gathers them again when any
of them exits. That's definitely worth investigating as an option for
x86, too.


smime.p7s
Description: S/MIME cryptographic signature


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2018 at 4:27 PM Dave Hansen  wrote:
>
> You're right that we could have a full physmap that we switch to for
> kmap()-like access to user pages.  But, the real problem is
> transitioning pages from kernel to user usage since it requires shooting
> down the old kernel mappings for those pages in some way.

You might decide that you simply don't care enough, and are willing to
leave possible stale TLB entries rather than shoot things down.

Then you'd still possibly see user pages in the kernel map, but only
for a fairly limited time, and only until the TLB entry gets re-used
for other reasons.

Even with kernel page table entries being marked global, their
lifetime in the TLB is likely not very long, and definitely not long
enough for some user that tries to scan for pages.

 Linus


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2018 at 4:27 PM Dave Hansen  wrote:
>
> You're right that we could have a full physmap that we switch to for
> kmap()-like access to user pages.  But, the real problem is
> transitioning pages from kernel to user usage since it requires shooting
> down the old kernel mappings for those pages in some way.

You might decide that you simply don't care enough, and are willing to
leave possible stale TLB entries rather than shoot things down.

Then you'd still possibly see user pages in the kernel map, but only
for a fairly limited time, and only until the TLB entry gets re-used
for other reasons.

Even with kernel page table entries being marked global, their
lifetime in the TLB is likely not very long, and definitely not long
enough for some user that tries to scan for pages.

 Linus


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Dave Hansen
On 08/20/2018 04:14 PM, David Woodhouse wrote:
> If you need the physmap, then rather than manually mapping with 4KiB
> pages, you just switch. Having first ensured that no malicious guest or
> userspace is running on a sibling, of course.

The problem is determining when "you need the physmap".  Tycho's
patches, as I remember them basically classify pages between being
"user" pages which are accessed only via kmap() and friends and "kernel"
pages which need to be mapped all the time because they might have a
'task_struct' or a page table or a 'struct file'.

You're right that we could have a full physmap that we switch to for
kmap()-like access to user pages.  But, the real problem is
transitioning pages from kernel to user usage since it requires shooting
down the old kernel mappings for those pages in some way.


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Dave Hansen
On 08/20/2018 04:14 PM, David Woodhouse wrote:
> If you need the physmap, then rather than manually mapping with 4KiB
> pages, you just switch. Having first ensured that no malicious guest or
> userspace is running on a sibling, of course.

The problem is determining when "you need the physmap".  Tycho's
patches, as I remember them basically classify pages between being
"user" pages which are accessed only via kmap() and friends and "kernel"
pages which need to be mapped all the time because they might have a
'task_struct' or a page table or a 'struct file'.

You're right that we could have a full physmap that we switch to for
kmap()-like access to user pages.  But, the real problem is
transitioning pages from kernel to user usage since it requires shooting
down the old kernel mappings for those pages in some way.


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread David Woodhouse


On Mon, 2018-08-20 at 15:59 -0700, Dave Hansen wrote:
> On 08/20/2018 03:35 PM, Tycho Andersen wrote:
> > Since meltdown hit, I haven't worked seriously on understand and
> > implementing his suggestions, in part because it wasn't clear to me
> > what pieces of the infrastructure we might be able to re-use. Someone
> > who knows more about mm/ might be able to suggest an approach, though
> 
> Unfortunately, I'm not sure there's much of KPTI we can reuse.  KPTI
> still has a very static kernel map (well, two static kernel maps) and
> XPFO really needs a much more dynamic map.
> 
> We do have a bit of infrastructure now to do TLB flushes near the kernel
> exit point, but it's entirely for the user address space, which isn't
> affected by XPFO.

One option is to have separate kernel address spaces, both with and
without the full physmap.

If you need the physmap, then rather than manually mapping with 4KiB
pages, you just switch. Having first ensured that no malicious guest or
userspace is running on a sibling, of course.

I'm not sure it's a win, but it might be worth looking at.

smime.p7s
Description: S/MIME cryptographic signature


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread David Woodhouse


On Mon, 2018-08-20 at 15:59 -0700, Dave Hansen wrote:
> On 08/20/2018 03:35 PM, Tycho Andersen wrote:
> > Since meltdown hit, I haven't worked seriously on understand and
> > implementing his suggestions, in part because it wasn't clear to me
> > what pieces of the infrastructure we might be able to re-use. Someone
> > who knows more about mm/ might be able to suggest an approach, though
> 
> Unfortunately, I'm not sure there's much of KPTI we can reuse.  KPTI
> still has a very static kernel map (well, two static kernel maps) and
> XPFO really needs a much more dynamic map.
> 
> We do have a bit of infrastructure now to do TLB flushes near the kernel
> exit point, but it's entirely for the user address space, which isn't
> affected by XPFO.

One option is to have separate kernel address spaces, both with and
without the full physmap.

If you need the physmap, then rather than manually mapping with 4KiB
pages, you just switch. Having first ensured that no malicious guest or
userspace is running on a sibling, of course.

I'm not sure it's a win, but it might be worth looking at.

smime.p7s
Description: S/MIME cryptographic signature


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Dave Hansen
On 08/20/2018 03:35 PM, Tycho Andersen wrote:
> Since meltdown hit, I haven't worked seriously on understand and
> implementing his suggestions, in part because it wasn't clear to me
> what pieces of the infrastructure we might be able to re-use. Someone
> who knows more about mm/ might be able to suggest an approach, though

Unfortunately, I'm not sure there's much of KPTI we can reuse.  KPTI
still has a very static kernel map (well, two static kernel maps) and
XPFO really needs a much more dynamic map.

We do have a bit of infrastructure now to do TLB flushes near the kernel
exit point, but it's entirely for the user address space, which isn't
affected by XPFO.



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Dave Hansen
On 08/20/2018 03:35 PM, Tycho Andersen wrote:
> Since meltdown hit, I haven't worked seriously on understand and
> implementing his suggestions, in part because it wasn't clear to me
> what pieces of the infrastructure we might be able to re-use. Someone
> who knows more about mm/ might be able to suggest an approach, though

Unfortunately, I'm not sure there's much of KPTI we can reuse.  KPTI
still has a very static kernel map (well, two static kernel maps) and
XPFO really needs a much more dynamic map.

We do have a bit of infrastructure now to do TLB flushes near the kernel
exit point, but it's entirely for the user address space, which isn't
affected by XPFO.



Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Tycho Andersen
On Mon, Aug 20, 2018 at 03:27:52PM -0700, Linus Torvalds wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
> 
> Ahh.
> 
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?
> 
> It used to be just 500 LOC. Was that because they took horrible
> shortcuts? Are the performance numbers for the 32-bit case that
> already had the kmap() overhead?

The last version I worked on was a bit before Meltdown was public:
https://lkml.org/lkml/2017/9/7/445

The overhead was a lot, but Dave Hansen gave some ideas about how to
speed things up in this thread: https://lkml.org/lkml/2017/9/20/828

Since meltdown hit, I haven't worked seriously on understand and
implementing his suggestions, in part because it wasn't clear to me
what pieces of the infrastructure we might be able to re-use. Someone
who knows more about mm/ might be able to suggest an approach, though.

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Tycho Andersen
On Mon, Aug 20, 2018 at 03:27:52PM -0700, Linus Torvalds wrote:
> On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
> >
> > It's the *kernel* we don't want being able to access those pages,
> > because of the multitude of unfixable cache load gadgets.
> 
> Ahh.
> 
> I guess the proof is in the pudding. Did somebody try to forward-port
> that patch set and see what the performance is like?
> 
> It used to be just 500 LOC. Was that because they took horrible
> shortcuts? Are the performance numbers for the 32-bit case that
> already had the kmap() overhead?

The last version I worked on was a bit before Meltdown was public:
https://lkml.org/lkml/2017/9/7/445

The overhead was a lot, but Dave Hansen gave some ideas about how to
speed things up in this thread: https://lkml.org/lkml/2017/9/20/828

Since meltdown hit, I haven't worked seriously on understand and
implementing his suggestions, in part because it wasn't clear to me
what pieces of the infrastructure we might be able to re-use. Someone
who knows more about mm/ might be able to suggest an approach, though.

Tycho


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

Ahh.

I guess the proof is in the pudding. Did somebody try to forward-port
that patch set and see what the performance is like?

It used to be just 500 LOC. Was that because they took horrible
shortcuts? Are the performance numbers for the 32-bit case that
already had the kmap() overhead?

  Linus


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2018 at 3:02 PM Woodhouse, David  wrote:
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

Ahh.

I guess the proof is in the pudding. Did somebody try to forward-port
that patch set and see what the performance is like?

It used to be just 500 LOC. Was that because they took horrible
shortcuts? Are the performance numbers for the 32-bit case that
already had the kmap() overhead?

  Linus


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Kees Cook
On Mon, Aug 20, 2018 at 2:52 PM, Woodhouse, David  wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
>>
>> Of course, after the long (and entirely unrelated) discussion about
>> the TLB flushing bug we had, I'm starting to worry about my own
>> competence, and maybe I'm missing something really fundamental, and
>> the XPFO patches do something else than what I think they do, or my
>> "hey, let's use our Meltdown code" idea has some fundamental weakness
>> that I'm missing.
>
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

Right. And even before Meltdown, it was desirable to remove those from
the physmap to avoid SMAP (and in some cases SMEP) bypasses (as
detailed in the mentioned paper:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf).

-Kees

-- 
Kees Cook
Pixel Security


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Kees Cook
On Mon, Aug 20, 2018 at 2:52 PM, Woodhouse, David  wrote:
> On Mon, 2018-08-20 at 14:48 -0700, Linus Torvalds wrote:
>>
>> Of course, after the long (and entirely unrelated) discussion about
>> the TLB flushing bug we had, I'm starting to worry about my own
>> competence, and maybe I'm missing something really fundamental, and
>> the XPFO patches do something else than what I think they do, or my
>> "hey, let's use our Meltdown code" idea has some fundamental weakness
>> that I'm missing.
>
> The interesting part is taking the user (and other) pages out of the
> kernel's 1:1 physmap.
>
> It's the *kernel* we don't want being able to access those pages,
> because of the multitude of unfixable cache load gadgets.

Right. And even before Meltdown, it was desirable to remove those from
the physmap to avoid SMAP (and in some cases SMEP) bypasses (as
detailed in the mentioned paper:
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf).

-Kees

-- 
Kees Cook
Pixel Security


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2018 at 2:26 PM Konrad Rzeszutek Wilk
 wrote:
>
> See eXclusive Page Frame Ownership (https://lwn.net/Articles/700606/) which 
> was posted
> way back in in 2016..

Ok, so my gut feel is that the above was reasonable within the context
of 2016, but that the XPFO model is completely pointless and wrong in
the post-Meltdown world we now live in.

Why?

Because with the Meltdown patches, we ALREADY HAVE the isolated page
tables that XPFO tries to do.

They are just the normal user page tables.

So don't go around doing other crazy things.

All you need to do is to literally:

 - before you enter VMX mode, switch to the user page tables

 - when you exit, switch back to the kernel page tables

don't do anything else.  You're done.

Now, this is complicated a bit by the fact that in order to enter VMX
mode with the user page tables, you do need to add the VMX state
itself to those user page tables (and add the actual trampoline code
to the vmenter too).

So it does imply we need to slightly extend the user mapping with a
few new patches, but that doesn't sound bad.

In fact, it sounds absolutely trivial to me.

The other thing you want to do is is the trivial optimization of "hey.
we exited VMX mode due to a host interrupt", which would look like
this:

 * switch to user page tables in order to do vmenter
 * vmenter
 * host interrupt happens
- switch to kernel page tables to handle irq
- do_IRQ etc
- switch back to user page tables
- iret
 * switch to kernel page tables because the vmenter returned

so you want to have some trivial short-circuiting of that last "switch
to user page tables and back" dance. It may actually be that we don't
even need it, because the irq code may just be looking at what *mode*
we were in, not what page tables we were in. I looked at that code
back in the meltdown days, but that's already so last-year now that we
have all these _other_ CPU bugs we handled.

But other than small details like that, doesn't this "use our Meltdown
user page table" sound like the right thing to do?

And note: no new VM code or complexity. None. We already have the
"isolated KVM context with only pages for the KVM process" case
handled.

Of course, after the long (and entirely unrelated) discussion about
the TLB flushing bug we had, I'm starting to worry about my own
competence, and maybe I'm missing something really fundamental, and
the XPFO patches do something else than what I think they do, or my
"hey, let's use our Meltdown code" idea has some fundamental weakness
that I'm missing.

  Linus


Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2018 at 2:26 PM Konrad Rzeszutek Wilk
 wrote:
>
> See eXclusive Page Frame Ownership (https://lwn.net/Articles/700606/) which 
> was posted
> way back in in 2016..

Ok, so my gut feel is that the above was reasonable within the context
of 2016, but that the XPFO model is completely pointless and wrong in
the post-Meltdown world we now live in.

Why?

Because with the Meltdown patches, we ALREADY HAVE the isolated page
tables that XPFO tries to do.

They are just the normal user page tables.

So don't go around doing other crazy things.

All you need to do is to literally:

 - before you enter VMX mode, switch to the user page tables

 - when you exit, switch back to the kernel page tables

don't do anything else.  You're done.

Now, this is complicated a bit by the fact that in order to enter VMX
mode with the user page tables, you do need to add the VMX state
itself to those user page tables (and add the actual trampoline code
to the vmenter too).

So it does imply we need to slightly extend the user mapping with a
few new patches, but that doesn't sound bad.

In fact, it sounds absolutely trivial to me.

The other thing you want to do is is the trivial optimization of "hey.
we exited VMX mode due to a host interrupt", which would look like
this:

 * switch to user page tables in order to do vmenter
 * vmenter
 * host interrupt happens
- switch to kernel page tables to handle irq
- do_IRQ etc
- switch back to user page tables
- iret
 * switch to kernel page tables because the vmenter returned

so you want to have some trivial short-circuiting of that last "switch
to user page tables and back" dance. It may actually be that we don't
even need it, because the irq code may just be looking at what *mode*
we were in, not what page tables we were in. I looked at that code
back in the meltdown days, but that's already so last-year now that we
have all these _other_ CPU bugs we handled.

But other than small details like that, doesn't this "use our Meltdown
user page table" sound like the right thing to do?

And note: no new VM code or complexity. None. We already have the
"isolated KVM context with only pages for the KVM process" case
handled.

Of course, after the long (and entirely unrelated) discussion about
the TLB flushing bug we had, I'm starting to worry about my own
competence, and maybe I'm missing something really fundamental, and
the XPFO patches do something else than what I think they do, or my
"hey, let's use our Meltdown code" idea has some fundamental weakness
that I'm missing.

  Linus


Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Konrad Rzeszutek Wilk
Hi!

See eXclusive Page Frame Ownership (https://lwn.net/Articles/700606/) which was 
posted
way back in in 2016..

In the last couple of months there has been a slew of CPU issues that have 
complicated
a lot of things. The latest - L1TF - is still fresh in folks's mind and it is
especially acute to virtualization workloads.

As such a bunch of various folks from different cloud companies (CCed) are 
looking
at a way to make Linux kernel be more resistant to hardware having these sort 
of 
bugs.

In particular we are looking at a way to "remove as many mappings from the 
global
kernel address space as possible. Specifically, while being in the
context of process A, memory of process B should not be visible in the
kernel." (email from Julian Stecklina). That is the high-level view and 
how this could get done, well, that is why posting this on
LKML/linux-hardening/kvm-devel/linux-mm to start the discussion.

Usually I would start with a draft of RFC patches so folks can rip it apart, but
thanks to other people (Juerg thank you!) it already exists:

(see https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1222756.html)

The idea would be to extend this to:

 1) Only do it for processes that run under CPUS which are in isolcpus list.

 2) Expand this to be a per-cpu page tables. That is each CPU has its own unique
set of pagetables - naturally _START_KERNEL -> __end would be mapped but the
rest would not.

Thoughts? Is this possible? Crazy? Better ideas?


Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-20 Thread Konrad Rzeszutek Wilk
Hi!

See eXclusive Page Frame Ownership (https://lwn.net/Articles/700606/) which was 
posted
way back in in 2016..

In the last couple of months there has been a slew of CPU issues that have 
complicated
a lot of things. The latest - L1TF - is still fresh in folks's mind and it is
especially acute to virtualization workloads.

As such a bunch of various folks from different cloud companies (CCed) are 
looking
at a way to make Linux kernel be more resistant to hardware having these sort 
of 
bugs.

In particular we are looking at a way to "remove as many mappings from the 
global
kernel address space as possible. Specifically, while being in the
context of process A, memory of process B should not be visible in the
kernel." (email from Julian Stecklina). That is the high-level view and 
how this could get done, well, that is why posting this on
LKML/linux-hardening/kvm-devel/linux-mm to start the discussion.

Usually I would start with a draft of RFC patches so folks can rip it apart, but
thanks to other people (Juerg thank you!) it already exists:

(see https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1222756.html)

The idea would be to extend this to:

 1) Only do it for processes that run under CPUS which are in isolcpus list.

 2) Expand this to be a per-cpu page tables. That is each CPU has its own unique
set of pagetables - naturally _START_KERNEL -> __end would be mapped but the
rest would not.

Thoughts? Is this possible? Crazy? Better ideas?