Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-19 Thread Thomas Gleixner
On Tue, Nov 17 2020 at 09:19, Alexandre Chartre wrote:
> On 11/16/20 9:24 PM, Borislav Petkov wrote:
>> On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
>> So PTI was added exactly to *not* have kernel memory mapped in the user
>> page table. You're partially reversing that...
>
> We are not reversing PTI, we are extending it.

You widen the exposure surface without providing an argument why it is safe.

> PTI removes all kernel mapping from the user page-table. However there's
> no issue with mapping some kernel data into the user page-table as long as
> these data have no sensitive information.

Define sensitive information. 

> Actually, PTI is already doing that but with a very limited scope. PTI adds
> into the user page-table some kernel mappings which are needed for userland
> to enter the kernel (such as the kernel entry text, the ESPFIX, the
> CPU_ENTRY_AREA_BASE...).
>
> So here, we are extending the PTI mapping so that we can execute more kernel
> code while using the user page-table; it's a kind of PTI on steroids.

Let's just look at a syscall:

noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
{
long ret;

enter_from_user_mode(regs);
  lockdep_hardirqs_off();
  user_exit_irqoff();
  trace_hardirqs_off_finish();

So just looking at the 3 calls above, how are you going to guarantee
that everything these callchains touch is mapped into user space?

Not to talk about everything which comes after that.

Thanks,

tglx




Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread Alexandre Chartre



On 11/18/20 12:29 PM, Borislav Petkov wrote:

On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote:

Well, it looks like I wrongfully assume that KPTI was a well known performance
overhead since it was introduced (because it adds extra page-table switches),
but you are right I should be presenting my own numbers.


Here's one recipe, courtesy of Mel:

https://github.com/gormanm/mmtests



Thanks for the detailed information, I have run the test and I see the same 
difference
as with the tools/perf and libMICRO I already sent: there's a 150% difference 
for
getpid() with and without pti.

alex.

-

# ../../compare-kernels.sh --baseline test-nopti --compare test-pti

poundsyscall
   test   test
  noptipti
Min   2 1.99 (   0.00%)5.08 (-155.28%)
Min   4 1.02 (   0.00%)2.60 (-154.90%)
Min   6 0.94 (   0.00%)2.07 (-120.21%)
Min   8 0.81 (   0.00%)1.60 ( -97.53%)
Min   120.85 (   0.00%)1.65 ( -94.12%)
Min   180.82 (   0.00%)1.61 ( -96.34%)
Min   240.81 (   0.00%)1.60 ( -97.53%)
Min   300.81 (   0.00%)1.60 ( -97.53%)
Min   320.81 (   0.00%)1.60 ( -97.53%)
Amean 2 2.02 (   0.00%)5.10 *-151.83%*
Amean 4 1.03 (   0.00%)2.61 *-151.98%*
Amean 6 0.96 (   0.00%)2.07 *-116.74%*
Amean 8 0.82 (   0.00%)1.60 * -96.56%*
Amean 120.87 (   0.00%)1.67 * -91.73%*
Amean 180.82 (   0.00%)1.63 * -97.94%*
Amean 240.81 (   0.00%)1.60 * -97.41%*
Amean 300.82 (   0.00%)1.60 * -96.93%*
Amean 320.82 (   0.00%)1.60 * -96.56%*
Stddev2 0.02 (   0.00%)0.02 (  33.78%)
Stddev4 0.01 (   0.00%)0.01 (   7.18%)
Stddev6 0.01 (   0.00%)0.00 (  68.77%)
Stddev8 0.01 (   0.00%)0.01 (  10.56%)
Stddev120.01 (   0.00%)0.02 ( -12.69%)
Stddev180.01 (   0.00%)0.01 (-107.25%)
Stddev240.00 (   0.00%)0.00 ( -14.56%)
Stddev300.01 (   0.00%)0.01 (   0.00%)
Stddev320.01 (   0.00%)0.00 (  20.00%)
CoeffVar  2 1.17 (   0.00%)0.31 (  73.70%)
CoeffVar  4 0.82 (   0.00%)0.30 (  63.16%)
CoeffVar  6 1.41 (   0.00%)0.20 (  85.59%)
CoeffVar  8 0.87 (   0.00%)0.39 (  54.50%)
CoeffVar  121.66 (   0.00%)0.98 (  41.23%)
CoeffVar  180.85 (   0.00%)0.89 (  -4.71%)
CoeffVar  240.52 (   0.00%)0.30 (  41.97%)
CoeffVar  300.65 (   0.00%)0.33 (  49.22%)
CoeffVar  320.65 (   0.00%)0.26 (  59.30%)
Max   2 2.04 (   0.00%)5.13 (-151.47%)
Max   4 1.04 (   0.00%)2.62 (-151.92%)
Max   6 0.98 (   0.00%)2.08 (-112.24%)
Max   8 0.83 (   0.00%)1.62 ( -95.18%)
Max   120.89 (   0.00%)1.70 ( -91.01%)
Max   180.84 (   0.00%)1.66 ( -97.62%)
Max   240.82 (   0.00%)1.61 ( -96.34%)
Max   300.82 (   0.00%)1.61 ( -96.34%)
Max   320.82 (   0.00%)1.61 ( -96.34%)
BAmean-50 2 2.01 (   0.00%)5.09 (-153.39%)
BAmean-50 4 1.03 (   0.00%)2.60 (-152.62%)
BAmean-50 6 0.95 (   0.00%)2.07 (-118.82%)
BAmean-50 8 0.81 (   0.00%)1.60 ( -97.53%)
BAmean-50 120.86 (   0.00%)1.66 ( -92.79%)
BAmean-50 180.82 (   0.00%)1.62 ( -97.56%)
BAmean-50 240.81 (   0.00%)1.60 ( -97.53%)
BAmean-50 300.81 (   0.00%)1.60 ( -97.53%)
BAmean-50 320.81 (   0.00%)1.60 ( -97.53%)
BAmean-95 2 2.02 (   0.00%)5.09 (-151.87%)
BAmean-95 4 1.03 (   0.00%)2.61 (-151.99%)
BAmean-95 6 0.95 (   0.00%)2.07 (-117.25%)
BAmean-95 8 0.81 (   0.00%)1.60 ( -96.72%)
BAmean-95 120.87 (   0.00%)1.67 ( -91.82%)
BAmean-95 180.82 (   0.00%)1.63 ( -97.97%)
BAmean-95 240.81 (   0.00%)1.60 ( -97.53%)
BAmean-95 300.81 (   0.00%)1.60 ( -97.00%)
BAmean-95 320.81 (   0.00%)1.60 ( -96.59%)
BAmean-99 2 2.02 (   0.00%)5.09 (-151.87%)
BAmean-99 4 1.03 (   0.00%)2.61 (-151.99%)
BAmean-99 6 0.95 (   0.00%)2.07 (-117.25%)
BAmean-99 8 0.81 (   0.00%)1.60 ( -96.72%)
BAmean-99 120.87 (   0.00%)1.67 ( -91.82%)
BAmean-99 180.82 (   0.00%)1.63 ( -97.97%)
BAmean-99 240.81 (   0.00%)1.60 ( -97.53%)
BAmean-99 300.81 (   0.00%)

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread Alexandre Chartre



On 11/18/20 2:22 PM, David Laight wrote:

From: Alexandre Chartre

Sent: 18 November 2020 10:30

...

Correct, this RFC is not changing the overhead. However, it is a step forward
for being able to execute some selected syscalls or interrupt handlers without
switching to the kernel page-table. The next step would be to identify and add
the necessary mapping to the user page-table so that specified syscalls can be
executed without switching the page-table.


Remember that without PTI user space can read all kernel memory.
(I'm not 100% sure you can force a cache-line read.)
It isn't even that slow.
(Even I can understand how it works.)

So if you are worried about user space doing that you can't really
run anything on the user page tables.


Yes, without PTI, userspace can read all kernel memory. But to run some
part of the kernel you don't need to have all kernel mappings. Also a lot
of the kernel contain non-sensitive information which can be safely expose
to userspace. So there's probably some room for running carefully selected
syscalls with the user page-table (and hopefully useful ones).
 


System calls like getpid() are irrelevant - they aren't used (much).
Even the time of day ones are implemented in the VDSO without a
context switch.


getpid()/getppid() is interesting because it provides the amount of overhead
PTI is adding. But the impact can be more important if some TLB flushing are
also required (as you mentioned below).



So the overheads come from other system calls that 'do work'
without actually sleeping.
I'm guessing things like read, write, sendmsg, recvmsg.

The only interesting system call I can think of is futex.
As well as all the calls that return immediately because the
mutex has been released while entering the kernel, I suspect
that being pre-empted by a different thread (of the same process)
doesn't actually need CR3 reloading (without PTI).

I also suspect that it isn't just the CR3 reload that costs.
There could (depending on the cpu) be associated TLB and/or cache
invalidations that have a much larger effect on programs with
large working sets than on simple benchmark programs.


Right, although the TLB flush is mitigated with PCID, but this has
more impact if there's no PCID.



Now bits of data that you are 'more worried about' could be kept
in physical memory that isn't normally mapped (or referenced by
a TLB) and only mapped when needed.
But that doesn't help the general case.



Note that having syscall which could be done without switching the
page-table is just one benefit you can get from this RFC. But the main
benefit is for integrating Address Space Isolation (ASI) which will be
much more complex if ASI as to plug in the current assembly CR3 switch.

Thanks,

alex.


RE: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread David Laight
From: Alexandre Chartre
> Sent: 18 November 2020 10:30
...
> Correct, this RFC is not changing the overhead. However, it is a step forward
> for being able to execute some selected syscalls or interrupt handlers without
> switching to the kernel page-table. The next step would be to identify and add
> the necessary mapping to the user page-table so that specified syscalls can be
> executed without switching the page-table.

Remember that without PTI user space can read all kernel memory.
(I'm not 100% sure you can force a cache-line read.)
It isn't even that slow.
(Even I can understand how it works.)

So if you are worried about user space doing that you can't really
run anything on the user page tables.

System calls like getpid() are irrelevant - they aren't used (much).
Even the time of day ones are implemented in the VDSO without a
context switch.

So the overheads come from other system calls that 'do work'
without actually sleeping.
I'm guessing things like read, write, sendmsg, recvmsg.

The only interesting system call I can think of is futex.
As well as all the calls that return immediately because the
mutex has been released while entering the kernel, I suspect
that being pre-empted by a different thread (of the same process)
doesn't actually need CR3 reloading (without PTI).

I also suspect that it isn't just the CR3 reload that costs.
There could (depending on the cpu) be associated TLB and/or cache
invalidations that have a much larger effect on programs with
large working sets than on simple benchmark programs.

Now bits of data that you are 'more worried about' could be kept
in physical memory that isn't normally mapped (or referenced by
a TLB) and only mapped when needed.
But that doesn't help the general case.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread Borislav Petkov
On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote:
> Well, it looks like I wrongfully assume that KPTI was a well known performance
> overhead since it was introduced (because it adds extra page-table switches),
> but you are right I should be presenting my own numbers.

Here's one recipe, courtesy of Mel:

https://github.com/gormanm/mmtests

"
./run-mmtests.sh --no-monitor --config configs/config-workload-poundsyscall 
test-default

# reboot the machine with pti disabled

./run-mmtests.sh --no-monitor --config configs/config-workload-poundsyscall 
test-nopti

poundsyscall just calls getppid() so it's a light-weight syscall and a
proxy measure for syscall entry/exit costs. To do the actual compare

cd work/log
../../compare-kernels.sh

and see what gain there is from disabling pti. If you want to compare
the other direction

../../compare-kernels.sh --baseline test-nopti --compare test-default

If you get an error about BinarySearch

(echo y;echo o conf prerequisites_policy follow;echo o conf commit)|cpan
yes | cpan List::BinarySearch

Only se the second line if you want to interactively confirm what cpan
should download and install."

I've CCed him should you have any questions.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread Alexandre Chartre



On 11/18/20 10:30 AM, David Laight wrote:

From: Alexandre Chartre

Sent: 18 November 2020 07:42


On 11/17/20 10:26 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:

Some benchmarks are available, in particular from phoronix:


What I was expecting was benchmarks *you* have run which show that
perf penalty, not something one can find quickly on the internet and
something one cannot always reproduce her-/himself.

You do know that presenting convincing numbers with a patchset greatly
improves its chances of getting it upstreamed, right?



Well, it looks like I wrongfully assume that KPTI was a well known performance
overhead since it was introduced (because it adds extra page-table switches),
but you are right I should be presenting my own numbers.


IIRC the penalty comes from the page table switch.
Doing it at a different time is unlikely to make much difference.



Correct, this RFC is not changing the overhead. However, it is a step forward
for being able to execute some selected syscalls or interrupt handlers without
switching to the kernel page-table. The next step would be to identify and add
the necessary mapping to the user page-table so that specified syscalls can be
executed without switching the page-table.



For some workloads the penalty is massive - getting on for 50%.
We are still using old kernels on AWS.



Here are some micro benchmarks of the getppid and getpid syscalls which 
highlight
the PTI overhead. This uses the kernel tools/perf command, and the getpid 
command
from libMICRO (https://github.com/redhat-performance/libMicro):

system running 5.10-rc4 booted with nopti:
--

# perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 1000 getppid() calls
 Total time: 0.792 [sec]

   0.079223 usecs/op
   12622549 ops/sec

# getpid -B 10
 prc thr   usecs/call  samples   errors cnt/samp
getpid 1   1  0.08029  1020   10


We can see that getpid and getppid syscall have the same execution
time around 0.08 usecs. These syscalls are very small and just return
a value, so the time is mostly spent entering/exiting the kernel.


same system booted with pti:


# perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 1000 getppid() calls
 Total time: 2.025 [sec]

   0.202527 usecs/op
4937605 ops/sec

# getpid -B 10
 prc thr   usecs/call  samples   errors cnt/samp
getpid 1   1  0.20241  1020   10


With PTI, the execution time jumps to 0.20 usecs (+0.12 usecs = +150%).

That's a very extreme case because these are very small syscalls, and
in that case the overhead to switch page-tables is significant compared
to the execution time of the syscall.

So with an overhead of +0.12 usecs per syscall, the PTI impact is significant
with workload which uses a lot of short syscalls. But if you use longer 
syscalls,
for example with an average execution time of 2.0 usecs per syscall then you
have a lower overhead of 6%.

alex.


RE: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread David Laight
From: Alexandre Chartre
> Sent: 18 November 2020 07:42
> 
> 
> On 11/17/20 10:26 PM, Borislav Petkov wrote:
> > On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
> >> Some benchmarks are available, in particular from phoronix:
> >
> > What I was expecting was benchmarks *you* have run which show that
> > perf penalty, not something one can find quickly on the internet and
> > something one cannot always reproduce her-/himself.
> >
> > You do know that presenting convincing numbers with a patchset greatly
> > improves its chances of getting it upstreamed, right?
> >
> 
> Well, it looks like I wrongfully assume that KPTI was a well known performance
> overhead since it was introduced (because it adds extra page-table switches),
> but you are right I should be presenting my own numbers.

IIRC the penalty comes from the page table switch.
Doing it at a different time is unlikely to make much difference.

For some workloads the penalty is massive - getting on for 50%.
We are still using old kernels on AWS.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre



On 11/17/20 10:26 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:

Some benchmarks are available, in particular from phoronix:


What I was expecting was benchmarks *you* have run which show that
perf penalty, not something one can find quickly on the internet and
something one cannot always reproduce her-/himself.

You do know that presenting convincing numbers with a patchset greatly
improves its chances of getting it upstreamed, right?



Well, it looks like I wrongfully assume that KPTI was a well known performance
overhead since it was introduced (because it adds extra page-table switches),
but you are right I should be presenting my own numbers.

Thanks,

alex.


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre



On 11/17/20 10:23 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote:

No. This prevents the guest VM from gathering data from the host
kernel on the same cpu-thread. But there's no mitigation for a guest
VM running on a cpu-thread attacking another cpu-thread (which can be
running another guest VM or the host kernel) from the same cpu-core.
You cannot use flush/clear barriers because the two cpu-threads are
running in parallel.


Now there's your justification for why you're doing this. It took a
while...

The "why" should always be part of the 0th message to provide
reviewers/maintainers with answers to the question, what this pile of
patches is all about. Please always add this rationale to your patchset
in the future.



Sorry about that, I will definitively try to do better next time. :-}

Thanks,

alex.


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Borislav Petkov
On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
> Some benchmarks are available, in particular from phoronix:

What I was expecting was benchmarks *you* have run which show that
perf penalty, not something one can find quickly on the internet and
something one cannot always reproduce her-/himself.

You do know that presenting convincing numbers with a patchset greatly
improves its chances of getting it upstreamed, right?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Borislav Petkov
On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote:
> No. This prevents the guest VM from gathering data from the host
> kernel on the same cpu-thread. But there's no mitigation for a guest
> VM running on a cpu-thread attacking another cpu-thread (which can be
> running another guest VM or the host kernel) from the same cpu-core.
> You cannot use flush/clear barriers because the two cpu-threads are
> running in parallel.

Now there's your justification for why you're doing this. It took a
while...

The "why" should always be part of the 0th message to provide
reviewers/maintainers with answers to the question, what this pile of
patches is all about. Please always add this rationale to your patchset
in the future.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre




On 11/17/20 7:28 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:

Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
the moment. In particular, this allows a guest VM to attack another guest VM
or the host kernel running on a sibling cpu-thread. Core Scheduling will
mitigate the guest-to-guest attack but not the guest-to-host attack.


I see in vmx_vcpu_enter_exit():

 /* L1D Flush includes CPU buffer clear to mitigate MDS */
 if (static_branch_unlikely(_l1d_should_flush))
 vmx_l1d_flush(vcpu);
 else if (static_branch_unlikely(_user_clear))
 mds_clear_cpu_buffers();

Is that not enough?


No. This prevents the guest VM from gathering data from the host kernel on the
same cpu-thread. But there's no mitigation for a guest VM running on a 
cpu-thread
attacking another cpu-thread (which can be running another guest VM or the
host kernel) from the same cpu-core. You cannot use flush/clear barriers because
the two cpu-threads are running in parallel.

alex.


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Borislav Petkov
On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:
> Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
> the moment. In particular, this allows a guest VM to attack another guest VM
> or the host kernel running on a sibling cpu-thread. Core Scheduling will
> mitigate the guest-to-guest attack but not the guest-to-host attack.

I see in vmx_vcpu_enter_exit():

/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(_l1d_should_flush))
vmx_l1d_flush(vcpu);
else if (static_branch_unlikely(_user_clear))
mds_clear_cpu_buffers();

Is that not enough?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre



On 11/17/20 6:07 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote:

We are not reversing PTI, we are extending it.


You're reversing it in the sense that you're mapping more kernel memory
into the user page table than what is mapped now.


PTI removes all kernel mapping from the user page-table. However there's
no issue with mapping some kernel data into the user page-table as long as
these data have no sensitive information.


I hope that is the case.


Actually, PTI is already doing that but with a very limited scope. PTI adds
into the user page-table some kernel mappings which are needed for userland
to enter the kernel (such as the kernel entry text, the ESPFIX, the
CPU_ENTRY_AREA_BASE...).

So here, we are extending the PTI mapping so that we can execute more kernel
code while using the user page-table; it's a kind of PTI on steroids.


And this is what bothers me - someone else might come after you and say,
but but, I need to map more stuff into the user pgt because I wanna do
X... and so on.


Agree, any addition should be strictly checked. I have been careful to expand
it to the minimum I needed.



The minimum size would be 1 page (4KB) as this is the minimum mapping size.
It's certainly enough for now as the usage of the PTI stack is limited, but
we will need larger stack if we won't to execute more kernel code with the
user page-table.


So on a big machine with a million tasks, that's at least a million
pages more which is what, ~4 Gb?

There better be a very good justification for the additional memory
consumption...


Yeah, adding a per-task allocation is my main concern, hence this RFC.


alex.


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre



On 11/17/20 5:55 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote:

The main goal of ASI is to provide KVM address space isolation to
mitigate guest-to-host speculative attacks like L1TF or MDS.


Because the current L1TF and MDS mitigations are lacking or why?



Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
the moment. In particular, this allows a guest VM to attack another guest VM
or the host kernel running on a sibling cpu-thread. Core Scheduling will
mitigate the guest-to-guest attack but not the guest-to-host attack. Address
Space Isolation provides a mitigation for guest-to-host attack.



Current proposal of ASI is plugged into the CR3 switch assembly macro
which make the code brittle and complex. (see [1])

I am also expected this might help with some other ideas like having
syscall (or interrupt handler) which can run without switching the
page-table.


I still fail to see why we need all that. I read, "this does this and
that" but I don't read "the current problem is this" and "this is our
suggested solution for it".

So what is the issue which needs addressing in the current kernel which
is going to justify adding all that code?


The main issue this is trying to address is that the CR3 switch is currently
done in assembly code from contexts which are very restrictive: the CR3 switch
is often done when only one or two registers are available for use, sometimes
no stack is available. For example, the syscall entry switches CR3 with a single
register available (%sp) and no stack.

Because of this, it is fairly tricky to expand the logic for switching CR3.
This is a problem that we have faced while implementing Address Space Isolation
(ASI) where we need extra logic to drive the page-table switch. We have 
successfully
implement ASI with the current CR3 switching assembly code, but this requires
complex assembly construction. Hence this proposal to defer CR3 switching to C
code so that it can be more easily expandable.

Hopefully this can also contribute to make the assembly entry code less complex,
and be beneficial to other projects.



PTI has a measured overhead of roughly 5% for most workloads, but it can
be much higher in some cases.


"it can be"? Where? Actual use case?


Some benchmarks are available, in particular from phoronix:

https://www.phoronix.com/scan.php?page=article=linux-more-x86pti
https://www.phoronix.com/scan.php?page=news_item=x86-PTI-Initial-Gaming-Tests
https://www.phoronix.com/scan.php?page=article=linux-kpti-kvm
https://medium.com/@loganaden/linux-kpti-performance-hit-on-real-workloads-8da185482df3



The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged
directly into the CR3 switch assembly macro. We are working on a new
implementation, based on these changes which avoid having to deal with
assembly code and makes the implementation more robust.


This still doesn't answer my questions. I read a lot of "could be used
for" formulations but I still don't know why we need that. So what is
the problem that the kernel currently has which you're trying to address
with this?



Hopefully this is clearer with the answer I provided above.

Thanks,

alex.


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Borislav Petkov
On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote:
> We are not reversing PTI, we are extending it.

You're reversing it in the sense that you're mapping more kernel memory
into the user page table than what is mapped now.

> PTI removes all kernel mapping from the user page-table. However there's
> no issue with mapping some kernel data into the user page-table as long as
> these data have no sensitive information.

I hope that is the case.

> Actually, PTI is already doing that but with a very limited scope. PTI adds
> into the user page-table some kernel mappings which are needed for userland
> to enter the kernel (such as the kernel entry text, the ESPFIX, the
> CPU_ENTRY_AREA_BASE...).
> 
> So here, we are extending the PTI mapping so that we can execute more kernel
> code while using the user page-table; it's a kind of PTI on steroids.

And this is what bothers me - someone else might come after you and say,
but but, I need to map more stuff into the user pgt because I wanna do
X... and so on.

> The minimum size would be 1 page (4KB) as this is the minimum mapping size.
> It's certainly enough for now as the usage of the PTI stack is limited, but
> we will need larger stack if we won't to execute more kernel code with the
> user page-table.

So on a big machine with a million tasks, that's at least a million
pages more which is what, ~4 Gb?

There better be a very good justification for the additional memory
consumption...

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Borislav Petkov
On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote:
> The main goal of ASI is to provide KVM address space isolation to
> mitigate guest-to-host speculative attacks like L1TF or MDS.

Because the current L1TF and MDS mitigations are lacking or why?

> Current proposal of ASI is plugged into the CR3 switch assembly macro
> which make the code brittle and complex. (see [1])
>
> I am also expected this might help with some other ideas like having
> syscall (or interrupt handler) which can run without switching the
> page-table.

I still fail to see why we need all that. I read, "this does this and
that" but I don't read "the current problem is this" and "this is our
suggested solution for it".

So what is the issue which needs addressing in the current kernel which
is going to justify adding all that code?

> PTI has a measured overhead of roughly 5% for most workloads, but it can
> be much higher in some cases.

"it can be"? Where? Actual use case?

> The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged
> directly into the CR3 switch assembly macro. We are working on a new
> implementation, based on these changes which avoid having to deal with
> assembly code and makes the implementation more robust.

This still doesn't answer my questions. I read a lot of "could be used
for" formulations but I still don't know why we need that. So what is
the problem that the kernel currently has which you're trying to address
with this?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre



On 11/16/20 9:24 PM, Borislav Petkov wrote:

On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

  - map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);

  - map additional data used in the entry code (such as stack canary);

  - run more entry code on the trampoline stack (which is mapped both
in the kernel and in the user page-table) until we switch to the
kernel page-table and then switch to the kernel stack;


So PTI was added exactly to *not* have kernel memory mapped in the user
page table. You're partially reversing that...


We are not reversing PTI, we are extending it.

PTI removes all kernel mapping from the user page-table. However there's
no issue with mapping some kernel data into the user page-table as long as
these data have no sensitive information.

Actually, PTI is already doing that but with a very limited scope. PTI adds
into the user page-table some kernel mappings which are needed for userland
to enter the kernel (such as the kernel entry text, the ESPFIX, the
CPU_ENTRY_AREA_BASE...).

So here, we are extending the PTI mapping so that we can execute more kernel
code while using the user page-table; it's a kind of PTI on steroids.



  - have a per-task trampoline stack instead of a per-cpu trampoline
stack, so the task can be scheduled out while it hasn't switched
to the kernel stack.


per-task? How much more memory is that per task?



Currently, this is done by doubling the size of the task stack (patch 8),
so that's an extra 8KB. Half of the stack is used as the regular kernel
stack, and the other half used as the PTI stack:

+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ *   +-+
+ *   | | ^   ^
+ *   | kernel-only | | KERNEL_STACK_SIZE |
+ *   |stack| |   |
+ *   | | V   |
+ *   +-+ <- top of kernel stack  | THREAD_SIZE
+ *   | | ^   |
+ *   | kernel and  | | KERNEL_STACK_SIZE |
+ *   | PTI stack   | |   |
+ *   | | V   v
+ *   +-+ <- top of stack
+ */

The minimum size would be 1 page (4KB) as this is the minimum mapping size.
It's certainly enough for now as the usage of the PTI stack is limited, but
we will need larger stack if we won't to execute more kernel code with the
user page-table.

alex.


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-16 Thread Alexandre Chartre



On 11/16/20 9:17 PM, Borislav Petkov wrote:

On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibility to execute some selected
syscall or interrupt handlers without switching to the kernel page-table


What for? What is this going to be used for in the end?



In addition to simplify the assembly entry code, this will also simplify
the integration of Address Space Isolation (ASI) which will certainly be
the primary beneficiary of this change. The main goal of ASI is to provide
KVM address space isolation to mitigate guest-to-host speculative attacks
like L1TF or MDS. Current proposal of ASI is plugged into the CR3 switch
assembly macro which make the code brittle and complex. (see [1])

I am also expected this might help with some other ideas like having
syscall (or interrupt handler) which can run without switching the
page-table.



(and thus avoid the PTI page-table switch overhead).


Overhead of how much? Why do we care?



PTI has a measured overhead of roughly 5% for most workloads, but it can
be much higher in some cases. The overhead is mostly due to the page-table
switch (even with PCID) so if we can run a syscall or an interrupt handler
without switching the page-table then we can get this kind of performance
back.



What is the big picture justfication for this diffstat


  21 files changed, 874 insertions(+), 314 deletions(-)


and the diffstat for the ASI enablement?



The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged directly into
the CR3 switch assembly macro. We are working on a new implementation, based
on these changes which avoid having to deal with assembly code and makes the
implementation more robust.

alex.

[1] ASI RFCv4 - 
https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.char...@oracle.com/


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-16 Thread Borislav Petkov
On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
> Deferring CR3 switch to C code means that we need to run more of the
> kernel entry code with the user page-table. To do so, we need to:
> 
>  - map more syscall, interrupt and exception entry code into the user
>page-table (map all noinstr code);
> 
>  - map additional data used in the entry code (such as stack canary);
> 
>  - run more entry code on the trampoline stack (which is mapped both
>in the kernel and in the user page-table) until we switch to the
>kernel page-table and then switch to the kernel stack;

So PTI was added exactly to *not* have kernel memory mapped in the user
page table. You're partially reversing that...

>  - have a per-task trampoline stack instead of a per-cpu trampoline
>stack, so the task can be scheduled out while it hasn't switched
>to the kernel stack.

per-task? How much more memory is that per task?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-16 Thread Borislav Petkov
On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
> This RFC proposes to defer the PTI CR3 switch until we reach C code.
> The benefit is that this simplifies the assembly entry code, and make
> the PTI CR3 switch code easier to understand. This also paves the way
> for further possible projects such an easier integration of Address
> Space Isolation (ASI), or the possibilily to execute some selected
> syscall or interrupt handlers without switching to the kernel page-table

What for? What is this going to be used for in the end?

> (and thus avoid the PTI page-table switch overhead).

Overhead of how much? Why do we care?

What is the big picture justfication for this diffstat

>  21 files changed, 874 insertions(+), 314 deletions(-)

and the diffstat for the ASI enablement?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


[RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-16 Thread Alexandre Chartre
Version 2 addressing comments from Andy:

- paranoid_entry/exit is back to assembly code. This avoids having
  a C version of SWAPGS and the need to disable stack-protector.
  (remove patches 8, 9, 21 from v1).

- SAVE_AND_SWITCH_TO_KERNEL_CR3 and RESTORE_CR3 are removed from
  paranoid_entry/exit and move to C (patch 19).

- __per_cpu_offset is mapped into the user page-table (patch 11)
  so that paranoid_entry can update GS before CR3 is switched.

- use a different stack canary with the user and kernel page-tables.
  This is a new patch in v2 to not leak the kernel stack canary
  in the user page-table (patch 21).

Patches are now based on v5.10-rc4.



With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

 - map more syscall, interrupt and exception entry code into the user
   page-table (map all noinstr code);

 - map additional data used in the entry code (such as stack canary);

 - run more entry code on the trampoline stack (which is mapped both
   in the kernel and in the user page-table) until we switch to the
   kernel page-table and then switch to the kernel stack;

 - have a per-task trampoline stack instead of a per-cpu trampoline
   stack, so the task can be scheduled out while it hasn't switched
   to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Patches are based on v5.10-rc4.

Thanks,

alex.

-

Alexandre Chartre (21):
  x86/syscall: Add wrapper for invoking syscall function
  x86/entry: Update asm_call_on_stack to support more function arguments
  x86/entry: Consolidate IST entry from userspace
  x86/sev-es: Define a setup stack function for the VC idtentry
  x86/entry: Implement ret_from_fork body with C code
  x86/pti: Provide C variants of PTI switch CR3 macros
  x86/entry: Fill ESPFIX stack using C code
  x86/pti: Introduce per-task PTI trampoline stack
  x86/pti: Function to clone page-table entries from a specified mm
  x86/pti: Function to map per-cpu page-table entry
  x86/pti: Extend PTI user mappings
  x86/pti: Use PTI stack instead of trampoline stack
  x86/pti: Execute syscall functions on the kernel stack
  x86/pti: Execute IDT handlers on the kernel stack
  x86/pti: Execute IDT handlers with error code on the kernel stack
  x86/pti: Execute system vector handlers on the kernel stack
  x86/pti: Execute page fault handler on the kernel stack
  x86/pti: Execute NMI handler on the kernel stack
  x86/pti: Defer CR3 switch to C code for IST entries
  x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
  x86/pti: Use a different stack canary with the user and kernel
page-table

 arch/x86/entry/common.c   |  58 -
 arch/x86/entry/entry_64.S | 346 +++---
 arch/x86/entry/entry_64_compat.S  |  22 --
 arch/x86/include/asm/entry-common.h   | 194 +++
 arch/x86/include/asm/idtentry.h   | 130 +-
 arch/x86/include/asm/irq_stack.h  |  11 +
 arch/x86/include/asm/page_64_types.h  |  36 ++-
 arch/x86/include/asm/processor.h  |   3 +
 arch/x86/include/asm/pti.h|  18 ++
 arch/x86/include/asm/stackprotector.h |  35 ++-
 arch/x86/include/asm/switch_to.h  |   7 +-
 arch/x86/include/asm/traps.h  |   2 +-
 arch/x86/kernel/cpu/mce/core.c|   7 +-
 arch/x86/kernel/espfix_64.c   |  41 +++
 arch/x86/kernel/nmi.c |  34 ++-
 arch/x86/kernel/sev-es.c  |  63 +
 arch/x86/kernel/traps.c   |  61 +++--
 arch/x86/mm/fault.c   |  11 +-