Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-25 Thread David H. Gutteridge
On Mon, 2018-07-23 at 09:29 +0200, Joerg Roedel wrote:
> Hey David,
> 
> On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> > driver. (This is the same VM definition I shared with you in a PM
> > back on Feb. 20th, except note that 4.18 kernels won't successfully
> > boot with QEMU's IDE device, so I'm using SATA instead. That's a
> > regression totally unrelated to your change sets, or to the general
> > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as
> > well.)
> 
> Yes, this needs the fixes in the tip/x86/mm branch as well. Can you
> that branch in and test again, please?

Sorry, I didn't realize I needed those changes, too. I've re-tested
with those applied and haven't encountered any issues. I'm now
re-testing again with your newer patch set from the 25th. No issues
so far with those, either; I'll confirm in that email thread after
the laptop has seen some more use.

Dave




Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-25 Thread David H. Gutteridge
On Mon, 2018-07-23 at 09:29 +0200, Joerg Roedel wrote:
> Hey David,
> 
> On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> > driver. (This is the same VM definition I shared with you in a PM
> > back on Feb. 20th, except note that 4.18 kernels won't successfully
> > boot with QEMU's IDE device, so I'm using SATA instead. That's a
> > regression totally unrelated to your change sets, or to the general
> > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as
> > well.)
> 
> Yes, this needs the fixes in the tip/x86/mm branch as well. Can you
> that branch in and test again, please?

Sorry, I didn't realize I needed those changes, too. I've re-tested
with those applied and haven't encountered any issues. I'm now
re-testing again with your newer patch set from the 25th. No issues
so far with those, either; I'll confirm in that email thread after
the laptop has seen some more use.

Dave




Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-24 Thread Pavel Machek
On Mon 2018-07-23 14:50:50, Andy Lutomirski wrote:
> 
> 
> > On Jul 23, 2018, at 2:38 PM, Pavel Machek  wrote:
> > 
> >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
> >>> 
> >>> Meanwhile... it looks like gcc is not slowed down significantly, but
> >>> other stuff sees 30% .. 40% slowdowns... which is rather
> >>> significant.
> >> 
> >> That is more or less expected.
> >> 
> >> Gcc spends about 90+% of its time in user space, and the system calls
> >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> >> no longer have the pipe between cpp and cc1, so they don't have that
> >> issue either (which would have sjhown the PTI slowdown a lot more)
> >> 
> >> Some other loads will do a lot more time traversing the user/kernel
> >> boundary, and in 32-bit mode you won't be able to take advantage of
> >> the address space ID's, so you really get the full effect.
> > 
> > Understood. Just -- bzip2 should include quite a lot of time in
> > userspace, too. 
> > 
> >>> Would it be possible to have per-process control of kpti? I have
> >>> some processes where trading of speed for security would make sense.
> >> 
> >> That was pretty extensively discussed, and no sane model for it was
> >> ever agreed upon.  Some people wanted it per-thread, others per-mm,
> >> and it wasn't clear how to set it either and how it should inherit
> >> across fork/exec, and what the namespace rules etc should be.
> >> 
> >> You absolutely need to inherit it (so that you can say "I trust this
> >> session" or whatever), but at the same time you *don't* want to
> >> inherit if you have a server you trust that then spawns user processes
> >> (think "I want systemd to not have the overhead, but the user
> >> processes it spawns obviously do need protection").
> >> 
> >> It was just a morass. Nothing came out of it.  I guess people can
> >> discuss it again, but it's not simple.
> > 
> > I agree it is not easy. OTOH -- 30% of user-visible performance is a
> > _lot_. That is worth spending man-years on...  Ok, problem is not as
> > severe on modern CPUs with address space ID's, but...
> > 
> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
> 
> No, you don’t want that. The problem is that Meltdown isn’t a problem that 
> exists in isolation. It’s very plausible that JavaScript code could trigger a 
> speculation attack that, with PTI off, could read kernel memory.

Ok, you are right. It is more tricky then I thought.

Still, I probably don't need to run grep's and cat's with PTI
on. Chromium (etc) probably needs it. Python interpretter running my
own code probably does not.

Yes, my Thinkpad X60 is probably thermal-throttled. It is not really
new machine. I switched to T40p for now :-).

What is the "worst" case people are seeing?
time dd if=/dev/zero of=/dev/null bs=1 count=1000
can reproduce 3x slowdown, but that's basically microbenchmark.

root-only per-process enable/disable of kpti should not be too hard to
do. Would there be interest in that?

Best regards,

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-24 Thread Pavel Machek
On Mon 2018-07-23 14:50:50, Andy Lutomirski wrote:
> 
> 
> > On Jul 23, 2018, at 2:38 PM, Pavel Machek  wrote:
> > 
> >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
> >>> 
> >>> Meanwhile... it looks like gcc is not slowed down significantly, but
> >>> other stuff sees 30% .. 40% slowdowns... which is rather
> >>> significant.
> >> 
> >> That is more or less expected.
> >> 
> >> Gcc spends about 90+% of its time in user space, and the system calls
> >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> >> no longer have the pipe between cpp and cc1, so they don't have that
> >> issue either (which would have sjhown the PTI slowdown a lot more)
> >> 
> >> Some other loads will do a lot more time traversing the user/kernel
> >> boundary, and in 32-bit mode you won't be able to take advantage of
> >> the address space ID's, so you really get the full effect.
> > 
> > Understood. Just -- bzip2 should include quite a lot of time in
> > userspace, too. 
> > 
> >>> Would it be possible to have per-process control of kpti? I have
> >>> some processes where trading of speed for security would make sense.
> >> 
> >> That was pretty extensively discussed, and no sane model for it was
> >> ever agreed upon.  Some people wanted it per-thread, others per-mm,
> >> and it wasn't clear how to set it either and how it should inherit
> >> across fork/exec, and what the namespace rules etc should be.
> >> 
> >> You absolutely need to inherit it (so that you can say "I trust this
> >> session" or whatever), but at the same time you *don't* want to
> >> inherit if you have a server you trust that then spawns user processes
> >> (think "I want systemd to not have the overhead, but the user
> >> processes it spawns obviously do need protection").
> >> 
> >> It was just a morass. Nothing came out of it.  I guess people can
> >> discuss it again, but it's not simple.
> > 
> > I agree it is not easy. OTOH -- 30% of user-visible performance is a
> > _lot_. That is worth spending man-years on...  Ok, problem is not as
> > severe on modern CPUs with address space ID's, but...
> > 
> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
> 
> No, you don’t want that. The problem is that Meltdown isn’t a problem that 
> exists in isolation. It’s very plausible that JavaScript code could trigger a 
> speculation attack that, with PTI off, could read kernel memory.

Ok, you are right. It is more tricky then I thought.

Still, I probably don't need to run grep's and cat's with PTI
on. Chromium (etc) probably needs it. Python interpretter running my
own code probably does not.

Yes, my Thinkpad X60 is probably thermal-throttled. It is not really
new machine. I switched to T40p for now :-).

What is the "worst" case people are seeing?
time dd if=/dev/zero of=/dev/null bs=1 count=1000
can reproduce 3x slowdown, but that's basically microbenchmark.

root-only per-process enable/disable of kpti should not be too hard to
do. Would there be interest in that?

Best regards,

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-24 Thread Andy Lutomirski



> On Jul 24, 2018, at 6:39 AM, Pavel Machek  wrote:
> 
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
>>> 
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>> 
>> That is more or less expected.
> 
> Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
> loop, I get (on v4.18) that, too.
> 
> 

...

The obvious cause would be thermal issues, which are increasingly common in 
laptops.  You could get cycle counts from perf stat, perhaps.


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-24 Thread Andy Lutomirski



> On Jul 24, 2018, at 6:39 AM, Pavel Machek  wrote:
> 
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
>>> 
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>> 
>> That is more or less expected.
> 
> Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
> loop, I get (on v4.18) that, too.
> 
> 

...

The obvious cause would be thermal issues, which are increasingly common in 
laptops.  You could get cycle counts from perf stat, perhaps.


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-24 Thread Pavel Machek
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
> 
> That is more or less expected.

Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
loop, I get (on v4.18) that, too.

That tells me that something is wrong with machine I'm using for
benchmarking. Whether KPTI  is enabled can still be measured with the
bzip2 pipeline, but the effect is far more subtle.

Pavel

pavel@amd:~$ while true; do time cat /dev/urandom | head -c 1000 |
bzip2 -9 - | wc -c ; done10044031
3.87user 0.91system 4.62 (0m4.622s) elapsed 103.48%CPU
10044234
4.03user 0.82system 4.68 (0m4.688s) elapsed 103.67%CPU
10043664
4.28user 0.85system 4.99 (0m4.994s) elapsed 102.90%CPU
10045959
4.43user 0.85system 5.12 (0m5.121s) elapsed 103.44%CPU
10043829
4.50user 0.89system 5.22 (0m5.228s) elapsed 103.22%CPU
10044296
4.65user 0.93system 5.39 (0m5.398s) elapsed 103.61%CPU
10045311
4.76user 0.93system 5.47 (0m5.479s) elapsed 103.98%CPU
10043819
4.81user 0.93system 5.55 (0m5.556s) elapsed 103.37%CPU
10045097
4.72user 1.04system 5.59 (0m5.597s) elapsed 103.01%CPU
10044012
4.86user 0.97system 5.68 (0m5.684s) elapsed 102.79%CPU
10044569
4.93user 0.96system 5.72 (0m5.728s) elapsed 102.92%CPU
10044141
4.94user 0.98system 5.75 (0m5.752s) elapsed 102.97%CPU
10043695
4.97user 0.95system 5.76 (0m5.768s) elapsed 102.87%CPU
10045690
5.12user 0.94system 5.90 (0m5.901s) elapsed 102.79%CPU
10045153
5.06user 1.00system 5.88 (0m5.883s) elapsed 103.21%CPU
10044560
5.10user 1.01system 5.92 (0m5.927s) elapsed 103.31%CPU
10044845
5.17user 0.99system 5.96 (0m5.960s) elapsed 103.44%CPU
10043884
5.15user 1.03system 6.00 (0m6.004s) elapsed 103.14%CPU
10044286
5.18user 1.01system 6.00 (0m6.002s) elapsed 103.40%CPU
10045749
5.00user 1.22system 6.04 (0m6.044s) elapsed 102.98%CPU
10044098
5.22user 1.02system 6.05 (0m6.053s) elapsed 103.21%CPU
10045326
5.20user 1.01system 6.04 (0m6.048s) elapsed 102.72%CPU
10042365
5.22user 1.03system 6.06 (0m6.061s) elapsed 103.30%CPU
10043952
5.24user 1.00system 6.06 (0m6.069s) elapsed 102.97%CPU
10044569
5.30user 1.00system 6.09 (0m6.099s) elapsed 103.46%CPU
10043241
5.26user 1.00system 6.09 (0m6.097s) elapsed 102.79%CPU
10044797
5.30user 1.01system 6.11 (0m6.114s) elapsed 103.46%CPU
10043711
5.25user 1.02system 6.09 (0m6.093s) elapsed 103.03%CPU
10043882
5.31user 1.01system 6.13 (0m6.131s) elapsed 103.28%CPU
10043571
5.26user 1.05system 6.13 (0m6.133s) elapsed 103.06%CPU
10044742
5.29user 1.03system 6.12 (0m6.122s) elapsed 103.25%CPU
10044170
5.35user 1.04system 6.18 (0m6.183s) elapsed 103.60%CPU
10043542
5.22user 1.12system 6.17 (0m6.172s) elapsed 102.89%CPU
10042985
5.25user 1.13system 6.19 (0m6.193s) elapsed 103.09%CPU
10044102
5.36user 1.01system 6.17 (0m6.177s) elapsed 103.17%CPU
10044609
5.48user 0.99system 6.28 (0m6.284s) elapsed 103.11%CPU
10045185
5.40user 1.03system 6.23 (0m6.236s) elapsed 103.29%CPU
1004
5.41user 1.06system 6.25 (0m6.255s) elapsed 103.55%CPU
10044859
5.35user 1.04system 6.20 (0m6.201s) elapsed 103.17%CPU
10045613

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-24 Thread Pavel Machek
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
> 
> That is more or less expected.

Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
loop, I get (on v4.18) that, too.

That tells me that something is wrong with machine I'm using for
benchmarking. Whether KPTI  is enabled can still be measured with the
bzip2 pipeline, but the effect is far more subtle.

Pavel

pavel@amd:~$ while true; do time cat /dev/urandom | head -c 1000 |
bzip2 -9 - | wc -c ; done10044031
3.87user 0.91system 4.62 (0m4.622s) elapsed 103.48%CPU
10044234
4.03user 0.82system 4.68 (0m4.688s) elapsed 103.67%CPU
10043664
4.28user 0.85system 4.99 (0m4.994s) elapsed 102.90%CPU
10045959
4.43user 0.85system 5.12 (0m5.121s) elapsed 103.44%CPU
10043829
4.50user 0.89system 5.22 (0m5.228s) elapsed 103.22%CPU
10044296
4.65user 0.93system 5.39 (0m5.398s) elapsed 103.61%CPU
10045311
4.76user 0.93system 5.47 (0m5.479s) elapsed 103.98%CPU
10043819
4.81user 0.93system 5.55 (0m5.556s) elapsed 103.37%CPU
10045097
4.72user 1.04system 5.59 (0m5.597s) elapsed 103.01%CPU
10044012
4.86user 0.97system 5.68 (0m5.684s) elapsed 102.79%CPU
10044569
4.93user 0.96system 5.72 (0m5.728s) elapsed 102.92%CPU
10044141
4.94user 0.98system 5.75 (0m5.752s) elapsed 102.97%CPU
10043695
4.97user 0.95system 5.76 (0m5.768s) elapsed 102.87%CPU
10045690
5.12user 0.94system 5.90 (0m5.901s) elapsed 102.79%CPU
10045153
5.06user 1.00system 5.88 (0m5.883s) elapsed 103.21%CPU
10044560
5.10user 1.01system 5.92 (0m5.927s) elapsed 103.31%CPU
10044845
5.17user 0.99system 5.96 (0m5.960s) elapsed 103.44%CPU
10043884
5.15user 1.03system 6.00 (0m6.004s) elapsed 103.14%CPU
10044286
5.18user 1.01system 6.00 (0m6.002s) elapsed 103.40%CPU
10045749
5.00user 1.22system 6.04 (0m6.044s) elapsed 102.98%CPU
10044098
5.22user 1.02system 6.05 (0m6.053s) elapsed 103.21%CPU
10045326
5.20user 1.01system 6.04 (0m6.048s) elapsed 102.72%CPU
10042365
5.22user 1.03system 6.06 (0m6.061s) elapsed 103.30%CPU
10043952
5.24user 1.00system 6.06 (0m6.069s) elapsed 102.97%CPU
10044569
5.30user 1.00system 6.09 (0m6.099s) elapsed 103.46%CPU
10043241
5.26user 1.00system 6.09 (0m6.097s) elapsed 102.79%CPU
10044797
5.30user 1.01system 6.11 (0m6.114s) elapsed 103.46%CPU
10043711
5.25user 1.02system 6.09 (0m6.093s) elapsed 103.03%CPU
10043882
5.31user 1.01system 6.13 (0m6.131s) elapsed 103.28%CPU
10043571
5.26user 1.05system 6.13 (0m6.133s) elapsed 103.06%CPU
10044742
5.29user 1.03system 6.12 (0m6.122s) elapsed 103.25%CPU
10044170
5.35user 1.04system 6.18 (0m6.183s) elapsed 103.60%CPU
10043542
5.22user 1.12system 6.17 (0m6.172s) elapsed 102.89%CPU
10042985
5.25user 1.13system 6.19 (0m6.193s) elapsed 103.09%CPU
10044102
5.36user 1.01system 6.17 (0m6.177s) elapsed 103.17%CPU
10044609
5.48user 0.99system 6.28 (0m6.284s) elapsed 103.11%CPU
10045185
5.40user 1.03system 6.23 (0m6.236s) elapsed 103.29%CPU
1004
5.41user 1.06system 6.25 (0m6.255s) elapsed 103.55%CPU
10044859
5.35user 1.04system 6.20 (0m6.201s) elapsed 103.17%CPU
10045613

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Dave Hansen
On 07/23/2018 02:59 PM, Josh Poimboeuf wrote:
> On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
>> But for now I'd like at least "global" option of turning pti on/off
>> during runtime for benchmarking. Let me see...
>>
>> Something like this, or is it going to be way more complex? Does
>> anyone have patch by chance?
> RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

Let's not forget PTI's NX-for-userspace in the kernel page tables.  That
provides Spectre V2 mitigation as well as Meltdown.


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Dave Hansen
On 07/23/2018 02:59 PM, Josh Poimboeuf wrote:
> On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
>> But for now I'd like at least "global" option of turning pti on/off
>> during runtime for benchmarking. Let me see...
>>
>> Something like this, or is it going to be way more complex? Does
>> anyone have patch by chance?
> RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

Let's not forget PTI's NX-for-userspace in the kernel page tables.  That
provides Spectre V2 mitigation as well as Meltdown.


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Josh Poimboeuf
On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
> But for now I'd like at least "global" option of turning pti on/off
> during runtime for benchmarking. Let me see...
> 
> Something like this, or is it going to be way more complex? Does
> anyone have patch by chance?

RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

-- 
Josh


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Josh Poimboeuf
On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
> But for now I'd like at least "global" option of turning pti on/off
> during runtime for benchmarking. Let me see...
> 
> Something like this, or is it going to be way more complex? Does
> anyone have patch by chance?

RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

-- 
Josh


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Pavel Machek
Hi!

> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
> 
> No, you don’t want that. The problem is that Meltdown isn’t a problem that 
> exists in isolation. It’s very plausible that JavaScript code could trigger a 
> speculation attack that, with PTI off, could read kernel memory.

Yeah, the web browser threads that run javascript code should have PTI
on. But maybe I want the rest of web browser with PTI off.

So... yes, I see why someone may want it per-thread (and not
per-process).

I guess per-process would be good enough for me. Actually, maybe even
per-uid. I don't have any fancy security here, so anything running uid
0 and 1000 is close enough to trusted.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Pavel Machek
Hi!

> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
> 
> No, you don’t want that. The problem is that Meltdown isn’t a problem that 
> exists in isolation. It’s very plausible that JavaScript code could trigger a 
> speculation attack that, with PTI off, could read kernel memory.

Yeah, the web browser threads that run javascript code should have PTI
on. But maybe I want the rest of web browser with PTI off.

So... yes, I see why someone may want it per-thread (and not
per-process).

I guess per-process would be good enough for me. Actually, maybe even
per-uid. I don't have any fancy security here, so anything running uid
0 and 1000 is close enough to trusted.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Andy Lutomirski



> On Jul 23, 2018, at 2:38 PM, Pavel Machek  wrote:
> 
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
>>> 
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>> 
>> That is more or less expected.
>> 
>> Gcc spends about 90+% of its time in user space, and the system calls
>> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
>> no longer have the pipe between cpp and cc1, so they don't have that
>> issue either (which would have sjhown the PTI slowdown a lot more)
>> 
>> Some other loads will do a lot more time traversing the user/kernel
>> boundary, and in 32-bit mode you won't be able to take advantage of
>> the address space ID's, so you really get the full effect.
> 
> Understood. Just -- bzip2 should include quite a lot of time in
> userspace, too. 
> 
>>> Would it be possible to have per-process control of kpti? I have
>>> some processes where trading of speed for security would make sense.
>> 
>> That was pretty extensively discussed, and no sane model for it was
>> ever agreed upon.  Some people wanted it per-thread, others per-mm,
>> and it wasn't clear how to set it either and how it should inherit
>> across fork/exec, and what the namespace rules etc should be.
>> 
>> You absolutely need to inherit it (so that you can say "I trust this
>> session" or whatever), but at the same time you *don't* want to
>> inherit if you have a server you trust that then spawns user processes
>> (think "I want systemd to not have the overhead, but the user
>> processes it spawns obviously do need protection").
>> 
>> It was just a morass. Nothing came out of it.  I guess people can
>> discuss it again, but it's not simple.
> 
> I agree it is not easy. OTOH -- 30% of user-visible performance is a
> _lot_. That is worth spending man-years on...  Ok, problem is not as
> severe on modern CPUs with address space ID's, but...
> 
> What I want is "if A can ptrace B, and B has pti disabled, A can have
> pti disabled as well". Now.. I see someone may want to have it
> per-thread, because for stuff like javascript JIT, thread may have
> rights to call ptrace, but is unable to call ptrace because JIT
> removed that ability... hmm...

No, you don’t want that. The problem is that Meltdown isn’t a problem that 
exists in isolation. It’s very plausible that JavaScript code could trigger a 
speculation attack that, with PTI off, could read kernel memory.


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Andy Lutomirski



> On Jul 23, 2018, at 2:38 PM, Pavel Machek  wrote:
> 
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
>>> 
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>> 
>> That is more or less expected.
>> 
>> Gcc spends about 90+% of its time in user space, and the system calls
>> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
>> no longer have the pipe between cpp and cc1, so they don't have that
>> issue either (which would have sjhown the PTI slowdown a lot more)
>> 
>> Some other loads will do a lot more time traversing the user/kernel
>> boundary, and in 32-bit mode you won't be able to take advantage of
>> the address space ID's, so you really get the full effect.
> 
> Understood. Just -- bzip2 should include quite a lot of time in
> userspace, too. 
> 
>>> Would it be possible to have per-process control of kpti? I have
>>> some processes where trading of speed for security would make sense.
>> 
>> That was pretty extensively discussed, and no sane model for it was
>> ever agreed upon.  Some people wanted it per-thread, others per-mm,
>> and it wasn't clear how to set it either and how it should inherit
>> across fork/exec, and what the namespace rules etc should be.
>> 
>> You absolutely need to inherit it (so that you can say "I trust this
>> session" or whatever), but at the same time you *don't* want to
>> inherit if you have a server you trust that then spawns user processes
>> (think "I want systemd to not have the overhead, but the user
>> processes it spawns obviously do need protection").
>> 
>> It was just a morass. Nothing came out of it.  I guess people can
>> discuss it again, but it's not simple.
> 
> I agree it is not easy. OTOH -- 30% of user-visible performance is a
> _lot_. That is worth spending man-years on...  Ok, problem is not as
> severe on modern CPUs with address space ID's, but...
> 
> What I want is "if A can ptrace B, and B has pti disabled, A can have
> pti disabled as well". Now.. I see someone may want to have it
> per-thread, because for stuff like javascript JIT, thread may have
> rights to call ptrace, but is unable to call ptrace because JIT
> removed that ability... hmm...

No, you don’t want that. The problem is that Meltdown isn’t a problem that 
exists in isolation. It’s very plausible that JavaScript code could trigger a 
speculation attack that, with PTI off, could read kernel memory.


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Pavel Machek
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
> 
> That is more or less expected.
> 
> Gcc spends about 90+% of its time in user space, and the system calls
> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> no longer have the pipe between cpp and cc1, so they don't have that
> issue either (which would have sjhown the PTI slowdown a lot more)
> 
> Some other loads will do a lot more time traversing the user/kernel
> boundary, and in 32-bit mode you won't be able to take advantage of
> the address space ID's, so you really get the full effect.

Understood. Just -- bzip2 should include quite a lot of time in
userspace, too. 

> > Would it be possible to have per-process control of kpti? I have
> > some processes where trading of speed for security would make sense.
> 
> That was pretty extensively discussed, and no sane model for it was
> ever agreed upon.  Some people wanted it per-thread, others per-mm,
> and it wasn't clear how to set it either and how it should inherit
> across fork/exec, and what the namespace rules etc should be.
> 
> You absolutely need to inherit it (so that you can say "I trust this
> session" or whatever), but at the same time you *don't* want to
> inherit if you have a server you trust that then spawns user processes
> (think "I want systemd to not have the overhead, but the user
> processes it spawns obviously do need protection").
> 
> It was just a morass. Nothing came out of it.  I guess people can
> discuss it again, but it's not simple.

I agree it is not easy. OTOH -- 30% of user-visible performance is a
_lot_. That is worth spending man-years on...  Ok, problem is not as
severe on modern CPUs with address space ID's, but...

What I want is "if A can ptrace B, and B has pti disabled, A can have
pti disabled as well". Now.. I see someone may want to have it
per-thread, because for stuff like javascript JIT, thread may have
rights to call ptrace, but is unable to call ptrace because JIT
removed that ability... hmm...

But for now I'd like at least "global" option of turning pti on/off
during runtime for benchmarking. Let me see...

Something like this, or is it going to be way more complex? Does
anyone have patch by chance?

Pavel

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index dfb975b..719e39a 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -162,6 +162,9 @@
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lend_\@
+   
movl%cr3, \scratch_reg
orl $PTI_SWITCH_MASK, \scratch_reg
movl\scratch_reg, %cr3
@@ -176,6 +179,8 @@
testl   $SEGMENT_RPL_MASK, PT_CS(%esp)
jz  .Lend_\@
.endif
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lend_\@
/* On user-cr3? */
movl%cr3, %eax
testl   $PTI_SWITCH_MASK, %eax
@@ -192,6 +197,10 @@
  */
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lend_\@
+
movl%cr3, \scratch_reg
/* Test if we are already on kernel CR3 */
testl   $PTI_SWITCH_MASK, \scratch_reg
@@ -302,6 +311,9 @@
 */
ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI
 
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lswitched_\@
+
testl   $PTI_SWITCH_MASK, \cr3_reg
jz  .Lswitched_\@
 
diff --git a/arch/x86/include/asm/cpu_entry_area.h 
b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b..8c92ae2 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -59,6 +59,7 @@ struct cpu_entry_area {
 #define CPU_ENTRY_AREA_TOT_SIZE(CPU_ENTRY_AREA_SIZE * NR_CPUS)
 
 DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+DECLARE_PER_CPU(int, pti_enabled);
 
 extern void setup_cpu_entry_areas(void);
 extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f73fa6f..da34a21 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,9 @@ void load_percpu_segment(int cpu)
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+DEFINE_PER_CPU(int, pti_enabled);
+
+
 #ifdef CONFIG_X86_64
 /*
  * Special IST stacks which the CPU switches to when it calls

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Pavel Machek
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
> 
> That is more or less expected.
> 
> Gcc spends about 90+% of its time in user space, and the system calls
> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> no longer have the pipe between cpp and cc1, so they don't have that
> issue either (which would have sjhown the PTI slowdown a lot more)
> 
> Some other loads will do a lot more time traversing the user/kernel
> boundary, and in 32-bit mode you won't be able to take advantage of
> the address space ID's, so you really get the full effect.

Understood. Just -- bzip2 should include quite a lot of time in
userspace, too. 

> > Would it be possible to have per-process control of kpti? I have
> > some processes where trading of speed for security would make sense.
> 
> That was pretty extensively discussed, and no sane model for it was
> ever agreed upon.  Some people wanted it per-thread, others per-mm,
> and it wasn't clear how to set it either and how it should inherit
> across fork/exec, and what the namespace rules etc should be.
> 
> You absolutely need to inherit it (so that you can say "I trust this
> session" or whatever), but at the same time you *don't* want to
> inherit if you have a server you trust that then spawns user processes
> (think "I want systemd to not have the overhead, but the user
> processes it spawns obviously do need protection").
> 
> It was just a morass. Nothing came out of it.  I guess people can
> discuss it again, but it's not simple.

I agree it is not easy. OTOH -- 30% of user-visible performance is a
_lot_. That is worth spending man-years on...  Ok, problem is not as
severe on modern CPUs with address space ID's, but...

What I want is "if A can ptrace B, and B has pti disabled, A can have
pti disabled as well". Now.. I see someone may want to have it
per-thread, because for stuff like javascript JIT, thread may have
rights to call ptrace, but is unable to call ptrace because JIT
removed that ability... hmm...

But for now I'd like at least "global" option of turning pti on/off
during runtime for benchmarking. Let me see...

Something like this, or is it going to be way more complex? Does
anyone have patch by chance?

Pavel

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index dfb975b..719e39a 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -162,6 +162,9 @@
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lend_\@
+   
movl%cr3, \scratch_reg
orl $PTI_SWITCH_MASK, \scratch_reg
movl\scratch_reg, %cr3
@@ -176,6 +179,8 @@
testl   $SEGMENT_RPL_MASK, PT_CS(%esp)
jz  .Lend_\@
.endif
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lend_\@
/* On user-cr3? */
movl%cr3, %eax
testl   $PTI_SWITCH_MASK, %eax
@@ -192,6 +197,10 @@
  */
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lend_\@
+
movl%cr3, \scratch_reg
/* Test if we are already on kernel CR3 */
testl   $PTI_SWITCH_MASK, \scratch_reg
@@ -302,6 +311,9 @@
 */
ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI
 
+   cmpl$1, PER_CPU_VAR(pti_enabled)
+   jne .Lswitched_\@
+
testl   $PTI_SWITCH_MASK, \cr3_reg
jz  .Lswitched_\@
 
diff --git a/arch/x86/include/asm/cpu_entry_area.h 
b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b..8c92ae2 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -59,6 +59,7 @@ struct cpu_entry_area {
 #define CPU_ENTRY_AREA_TOT_SIZE(CPU_ENTRY_AREA_SIZE * NR_CPUS)
 
 DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+DECLARE_PER_CPU(int, pti_enabled);
 
 extern void setup_cpu_entry_areas(void);
 extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f73fa6f..da34a21 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,9 @@ void load_percpu_segment(int cpu)
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+DEFINE_PER_CPU(int, pti_enabled);
+
+
 #ifdef CONFIG_X86_64
 /*
  * Special IST stacks which the CPU switches to when it calls

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Linus Torvalds
On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
>
> Meanwhile... it looks like gcc is not slowed down significantly, but
> other stuff sees 30% .. 40% slowdowns... which is rather
> significant.

That is more or less expected.

Gcc spends about 90+% of its time in user space, and the system calls
it *does* do tend to be "real work" (open/read/etc). And modern gcc's
no longer have the pipe between cpp and cc1, so they don't have that
issue either (which would have sjhown the PTI slowdown a lot more)

Some other loads will do a lot more time traversing the user/kernel
boundary, and in 32-bit mode you won't be able to take advantage of
the address space ID's, so you really get the full effect.

> Would it be possible to have per-process control of kpti? I have
> some processes where trading of speed for security would make sense.

That was pretty extensively discussed, and no sane model for it was
ever agreed upon.  Some people wanted it per-thread, others per-mm,
and it wasn't clear how to set it either and how it should inherit
across fork/exec, and what the namespace rules etc should be.

You absolutely need to inherit it (so that you can say "I trust this
session" or whatever), but at the same time you *don't* want to
inherit if you have a server you trust that then spawns user processes
(think "I want systemd to not have the overhead, but the user
processes it spawns obviously do need protection").

It was just a morass. Nothing came out of it.  I guess people can
discuss it again, but it's not simple.

   Linus


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Linus Torvalds
On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek  wrote:
>
> Meanwhile... it looks like gcc is not slowed down significantly, but
> other stuff sees 30% .. 40% slowdowns... which is rather
> significant.

That is more or less expected.

Gcc spends about 90+% of its time in user space, and the system calls
it *does* do tend to be "real work" (open/read/etc). And modern gcc's
no longer have the pipe between cpp and cc1, so they don't have that
issue either (which would have sjhown the PTI slowdown a lot more)

Some other loads will do a lot more time traversing the user/kernel
boundary, and in 32-bit mode you won't be able to take advantage of
the address space ID's, so you really get the full effect.

> Would it be possible to have per-process control of kpti? I have
> some processes where trading of speed for security would make sense.

That was pretty extensively discussed, and no sane model for it was
ever agreed upon.  Some people wanted it per-thread, others per-mm,
and it wasn't clear how to set it either and how it should inherit
across fork/exec, and what the namespace rules etc should be.

You absolutely need to inherit it (so that you can say "I trust this
session" or whatever), but at the same time you *don't* want to
inherit if you have a server you trust that then spawns user processes
(think "I want systemd to not have the overhead, but the user
processes it spawns obviously do need protection").

It was just a morass. Nothing came out of it.  I guess people can
discuss it again, but it's not simple.

   Linus


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Pavel Machek
Hi!

> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:

It seems PTI is now in -next. I'll test that soon.

Meanwhile... it looks like gcc is not slowed down significantly, but
other stuff sees 30% .. 40% slowdowns... which is rather
significant.

Would it be possible to have per-process control of kpti? I have
some processes where trading of speed for security would make sense.

Best regards,
Pavel

cd ~/g/tui/nowcast
time ./nowcast -x (30%)
KPTI: 139.25user 73.65system 269.90 (4m29.901s) elapsed 78.88%CPU
  133.35user 73.15system 228.80 (3m48.802s) elapsed 90.25%CPU
  140.51user 74.21system 218.33 (3m38.338s) elapsed 98.34%CPU
  133.85user 75.89system 212.02 (3m32.026s) elapsed 98.93%CPU (no chromium)
  139.34user 75.00system 235.75 (3m55.752s) elapsed 90.92%CPU
  
4.18: 116.99user 43.79system 217.65 (3m37.653s) elapsed 73.87%CPU
  115.14user 43.97system 178.85 (2m58.855s) elapsed 88.96%CPU
  128.47user 47.22system 178.24 (2m58.245s) elapsed 98.57%CPU
  132.30user 49.27system 184.40 (3m4.408s) elapsed 98.46%CPU
  134.88user 48.59system 186.67 (3m6.673s) elapsed 98.29%CPU
  132.15user 48.65system 524.68 (8m44.684s) elapsed 34.46%CPU
  120.38user 45.45system 168.72 (2m48.720s) elapsed 98.29%CPU
  
time cat /dev/urandom | head -c 1000 |  bzip2 -9 - | wc -c (40%)
v4.18: 4.57user 0.23system 4.64 (0m4.644s) elapsed 103.53%CPU
   4.86user 0.23system 4.95 (0m4.952s) elapsed 102.81%CPU
   5.13user 0.22system 5.19 (0m5.190s) elapsed 103.14%CPU
KPTI:  6.39user 0.48system 6.74 (0m6.747s) elapsed 101.96%CPU
   6.66user 0.41system 6.91 (0m6.912s) elapsed 102.51%CPU
   6.53user 0.51system 6.91 (0m6.919s) elapsed 101.99%CPU

v4l-utils: make clean, time make
v4.18: 191.93user 11.00system 211.19 (3m31.191s) elapsed 96.09%CPU
   221.21user 14.69system 248.73 (4m8.734s) elapsed 94.84%CPU
   198.35user 11.61system 211.39 (3m31.392s) elapsed 99.32%CPU
   204.87user 11.69system 217.97 (3m37.971s) elapsed 99.35%CPU
   203.68user 11.88system 217.29 (3m37.291s) elapsed 99.20%CPU
KPTI:  156.45user 40.08system 204.77 (3m24.777s) elapsed 95.97%CPU
   183.32user 38.64system 225.03 (3m45.031s) elapsed 98.63%CPU

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Pavel Machek
Hi!

> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:

It seems PTI is now in -next. I'll test that soon.

Meanwhile... it looks like gcc is not slowed down significantly, but
other stuff sees 30% .. 40% slowdowns... which is rather
significant.

Would it be possible to have per-process control of kpti? I have
some processes where trading of speed for security would make sense.

Best regards,
Pavel

cd ~/g/tui/nowcast
time ./nowcast -x (30%)
KPTI: 139.25user 73.65system 269.90 (4m29.901s) elapsed 78.88%CPU
  133.35user 73.15system 228.80 (3m48.802s) elapsed 90.25%CPU
  140.51user 74.21system 218.33 (3m38.338s) elapsed 98.34%CPU
  133.85user 75.89system 212.02 (3m32.026s) elapsed 98.93%CPU (no chromium)
  139.34user 75.00system 235.75 (3m55.752s) elapsed 90.92%CPU
  
4.18: 116.99user 43.79system 217.65 (3m37.653s) elapsed 73.87%CPU
  115.14user 43.97system 178.85 (2m58.855s) elapsed 88.96%CPU
  128.47user 47.22system 178.24 (2m58.245s) elapsed 98.57%CPU
  132.30user 49.27system 184.40 (3m4.408s) elapsed 98.46%CPU
  134.88user 48.59system 186.67 (3m6.673s) elapsed 98.29%CPU
  132.15user 48.65system 524.68 (8m44.684s) elapsed 34.46%CPU
  120.38user 45.45system 168.72 (2m48.720s) elapsed 98.29%CPU
  
time cat /dev/urandom | head -c 1000 |  bzip2 -9 - | wc -c (40%)
v4.18: 4.57user 0.23system 4.64 (0m4.644s) elapsed 103.53%CPU
   4.86user 0.23system 4.95 (0m4.952s) elapsed 102.81%CPU
   5.13user 0.22system 5.19 (0m5.190s) elapsed 103.14%CPU
KPTI:  6.39user 0.48system 6.74 (0m6.747s) elapsed 101.96%CPU
   6.66user 0.41system 6.91 (0m6.912s) elapsed 102.51%CPU
   6.53user 0.51system 6.91 (0m6.919s) elapsed 101.99%CPU

v4l-utils: make clean, time make
v4.18: 191.93user 11.00system 211.19 (3m31.191s) elapsed 96.09%CPU
   221.21user 14.69system 248.73 (4m8.734s) elapsed 94.84%CPU
   198.35user 11.61system 211.39 (3m31.392s) elapsed 99.32%CPU
   204.87user 11.69system 217.97 (3m37.971s) elapsed 99.35%CPU
   203.68user 11.88system 217.29 (3m37.291s) elapsed 99.20%CPU
KPTI:  156.45user 40.08system 204.77 (3m24.777s) elapsed 95.97%CPU
   183.32user 38.64system 225.03 (3m45.031s) elapsed 98.63%CPU

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Joerg Roedel
Hey David,

On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> driver. (This is the same VM definition I shared with you in a PM
> back on Feb. 20th, except note that 4.18 kernels won't successfully
> boot with QEMU's IDE device, so I'm using SATA instead. That's a
> regression totally unrelated to your change sets, or to the general
> booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

Yes, this needs the fixes in the tip/x86/mm branch as well. Can you that
branch in and test again, please?


Thanks,

Joerg



Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-23 Thread Joerg Roedel
Hey David,

On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> driver. (This is the same VM definition I shared with you in a PM
> back on Feb. 20th, except note that 4.18 kernels won't successfully
> boot with QEMU's IDE device, so I'm using SATA instead. That's a
> regression totally unrelated to your change sets, or to the general
> booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

Yes, this needs the fixes in the tip/x86/mm branch as well. Can you that
branch in and test again, please?


Thanks,

Joerg



Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-22 Thread David H. Gutteridge
On Fri, 2018-07-20 at 18:22 +0200, Joerg Roedel wrote:
> Hi,
> 
> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:
> 
>   Patch 1: Very important, it fixes a vmalloc-fault in NMI context
>when PTI is enabled. This is pretty unlikely to hit
>when starting perf on an idle machine, which is why I
>didn't find it earlier in my testing. I always started
>'perf top' first :/ But when I start 'perf top' last
>when the kernel-compile already runs, it hits almost
>immediatly.
> 
>   Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
>to also take VM86 into account. This is not strictly
>necessary because the slow-path also works for VM86
>mode but it is not how the code was intended to work.
>And it breaks when Patch 3 is applied on-top.
> 
>   Patch 3: Implement the reduced copying in the paranoid
>entry/exit path as suggested by Andy Lutomirski while
>reviewing version 7 of the original patches.
> 
> I have the x86/tip branch with these patches on-top running my test
> for
> 6h now, with no issues so far. So for now it looks like there are no
> scheduling points or irq-enabled sections reached from the paranoid
> entry/exit paths and we always return to the entry-stack we came from.
> 
> I keep the test running over the weekend at least.
> 
> Please review.
> 
> [ If Patch 1 looks good to the maintainers I suggest applying it soon,
>   before too many linux-next testers run into this issue. It is
> actually
>   the reason why I send out the patches _now_ and didn't wait until
> next
>   week when the other two patches got more testing from my side. ]
> 
> Thanks,
> 
>   Joerg
> 
> Joerg Roedel (3):
>   perf/core: Make sure the ring-buffer is mapped in all page-tables
>   x86/entry/32: Check for VM86 mode in slow-path check
>   x86/entry/32: Copy only ptregs on paranoid entry/exit path
> 
>  arch/x86/entry/entry_32.S   | 82 ++
> ---
>  kernel/events/ring_buffer.c | 10 ++
>  2 files changed, 58 insertions(+), 34 deletions(-)

Hi Joerg,

I tested again using the tip "x86/pti" branch (with two of the three
patches in your change set already applied), and manually applied
your third patch on top of it (I see there was some debate about it,
but I thought I'd include it), plus I had to manually apply the patch
to fix booting (d1b47a7c9efcf3c3384b70f6e3c8f1423b44d8c7: "mm: don't
do zero_resv_unavail if memmap is not allocated"), since "x86/pti"
doesn't include it yet.

Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
driver. (This is the same VM definition I shared with you in a PM
back on Feb. 20th, except note that 4.18 kernels won't successfully
boot with QEMU's IDE device, so I'm using SATA instead. That's a
regression totally unrelated to your change sets, or to the general
booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

[drm] Found bochs VGA, ID 0xb0c0.
[drm] Framebuffer size 16384 kB @ 0xfd00, mmio @ 0xfebd4000.
[TTM] Zone  kernel: Available graphics memory: 390536 kiB
[TTM] Zone highmem: Available graphics memory: 4659530 kiB
[TTM] Initializing pool allocator
[TTM] Initializing DMA pool allocator
[ cut here ]
kernel BUG at arch/x86/mm/fault.c:269!
invalid opcode:  [#1] SMP PTI
CPU: 0 PID: 349 Comm: systemd-udevd Tainted: GW 4.18.0-
rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
04/01/2014
EIP: vmalloc_fault+0x1d7/0x200
Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00
00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d
b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3 
EAX: 02788000 EBX: c85b6de8 ECX: 0080 EDX: 
ESI: 000fd000 EDI: fdf3 EBP: f4743994 ESP: f474397c
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083
CR0: 80050033 CR2: f7a0 CR3: 347f6000 CR4: 06f0
Call Trace:
 __do_page_fault+0x340/0x4c0
 do_page_fault+0x25/0xf0
 ? ttm_mem_reg_ioremap+0xe5/0x100 [ttm]
 ? kvm_async_pf_task_wait+0x1b0/0x1b0
 do_async_page_fault+0x55/0x80
 common_exception+0x13f/0x146
EIP: memset+0xb/0x20
Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b
5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0  aa 89
d8 5b 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 90 66 
EAX:  EBX: f7a0 ECX: 0030 EDX: 
ESI: f4743b9c EDI: f7a0 EBP: f4743a4c ESP: f4743a44
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206
 ttm_bo_move_memcpy+0x49a/0x4c0 [ttm]
 ? _cond_resched+0x17/0x40
 ttm_bo_handle_move_mem+0x554/0x570 [ttm]
 ? ttm_bo_mem_space+0x211/0x440 [ttm]
 ttm_bo_validate+0xf5/0x110 [ttm]
 bochs_bo_pin+0xde/0x1c0 [bochs_drm]
 

Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-22 Thread David H. Gutteridge
On Fri, 2018-07-20 at 18:22 +0200, Joerg Roedel wrote:
> Hi,
> 
> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:
> 
>   Patch 1: Very important, it fixes a vmalloc-fault in NMI context
>when PTI is enabled. This is pretty unlikely to hit
>when starting perf on an idle machine, which is why I
>didn't find it earlier in my testing. I always started
>'perf top' first :/ But when I start 'perf top' last
>when the kernel-compile already runs, it hits almost
>immediatly.
> 
>   Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
>to also take VM86 into account. This is not strictly
>necessary because the slow-path also works for VM86
>mode but it is not how the code was intended to work.
>And it breaks when Patch 3 is applied on-top.
> 
>   Patch 3: Implement the reduced copying in the paranoid
>entry/exit path as suggested by Andy Lutomirski while
>reviewing version 7 of the original patches.
> 
> I have the x86/tip branch with these patches on-top running my test
> for
> 6h now, with no issues so far. So for now it looks like there are no
> scheduling points or irq-enabled sections reached from the paranoid
> entry/exit paths and we always return to the entry-stack we came from.
> 
> I keep the test running over the weekend at least.
> 
> Please review.
> 
> [ If Patch 1 looks good to the maintainers I suggest applying it soon,
>   before too many linux-next testers run into this issue. It is
> actually
>   the reason why I send out the patches _now_ and didn't wait until
> next
>   week when the other two patches got more testing from my side. ]
> 
> Thanks,
> 
>   Joerg
> 
> Joerg Roedel (3):
>   perf/core: Make sure the ring-buffer is mapped in all page-tables
>   x86/entry/32: Check for VM86 mode in slow-path check
>   x86/entry/32: Copy only ptregs on paranoid entry/exit path
> 
>  arch/x86/entry/entry_32.S   | 82 ++
> ---
>  kernel/events/ring_buffer.c | 10 ++
>  2 files changed, 58 insertions(+), 34 deletions(-)

Hi Joerg,

I tested again using the tip "x86/pti" branch (with two of the three
patches in your change set already applied), and manually applied
your third patch on top of it (I see there was some debate about it,
but I thought I'd include it), plus I had to manually apply the patch
to fix booting (d1b47a7c9efcf3c3384b70f6e3c8f1423b44d8c7: "mm: don't
do zero_resv_unavail if memmap is not allocated"), since "x86/pti"
doesn't include it yet.

Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
driver. (This is the same VM definition I shared with you in a PM
back on Feb. 20th, except note that 4.18 kernels won't successfully
boot with QEMU's IDE device, so I'm using SATA instead. That's a
regression totally unrelated to your change sets, or to the general
booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

[drm] Found bochs VGA, ID 0xb0c0.
[drm] Framebuffer size 16384 kB @ 0xfd00, mmio @ 0xfebd4000.
[TTM] Zone  kernel: Available graphics memory: 390536 kiB
[TTM] Zone highmem: Available graphics memory: 4659530 kiB
[TTM] Initializing pool allocator
[TTM] Initializing DMA pool allocator
[ cut here ]
kernel BUG at arch/x86/mm/fault.c:269!
invalid opcode:  [#1] SMP PTI
CPU: 0 PID: 349 Comm: systemd-udevd Tainted: GW 4.18.0-
rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
04/01/2014
EIP: vmalloc_fault+0x1d7/0x200
Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00
00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d
b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3 
EAX: 02788000 EBX: c85b6de8 ECX: 0080 EDX: 
ESI: 000fd000 EDI: fdf3 EBP: f4743994 ESP: f474397c
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083
CR0: 80050033 CR2: f7a0 CR3: 347f6000 CR4: 06f0
Call Trace:
 __do_page_fault+0x340/0x4c0
 do_page_fault+0x25/0xf0
 ? ttm_mem_reg_ioremap+0xe5/0x100 [ttm]
 ? kvm_async_pf_task_wait+0x1b0/0x1b0
 do_async_page_fault+0x55/0x80
 common_exception+0x13f/0x146
EIP: memset+0xb/0x20
Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b
5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0  aa 89
d8 5b 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 90 66 
EAX:  EBX: f7a0 ECX: 0030 EDX: 
ESI: f4743b9c EDI: f7a0 EBP: f4743a4c ESP: f4743a44
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206
 ttm_bo_move_memcpy+0x49a/0x4c0 [ttm]
 ? _cond_resched+0x17/0x40
 ttm_bo_handle_move_mem+0x554/0x570 [ttm]
 ? ttm_bo_mem_space+0x211/0x440 [ttm]
 ttm_bo_validate+0xf5/0x110 [ttm]
 bochs_bo_pin+0xde/0x1c0 [bochs_drm]
 

[PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-20 Thread Joerg Roedel
Hi,

here are 3 patches which update the PTI-x86-32 patches recently merged
into the tip-tree. The patches are ordered by importance:

Patch 1: Very important, it fixes a vmalloc-fault in NMI context
 when PTI is enabled. This is pretty unlikely to hit
 when starting perf on an idle machine, which is why I
 didn't find it earlier in my testing. I always started
 'perf top' first :/ But when I start 'perf top' last
 when the kernel-compile already runs, it hits almost
 immediatly.

Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
 to also take VM86 into account. This is not strictly
 necessary because the slow-path also works for VM86
 mode but it is not how the code was intended to work.
 And it breaks when Patch 3 is applied on-top.

Patch 3: Implement the reduced copying in the paranoid
 entry/exit path as suggested by Andy Lutomirski while
 reviewing version 7 of the original patches.

I have the x86/tip branch with these patches on-top running my test for
6h now, with no issues so far. So for now it looks like there are no
scheduling points or irq-enabled sections reached from the paranoid
entry/exit paths and we always return to the entry-stack we came from.

I keep the test running over the weekend at least.

Please review.

[ If Patch 1 looks good to the maintainers I suggest applying it soon,
  before too many linux-next testers run into this issue. It is actually
  the reason why I send out the patches _now_ and didn't wait until next
  week when the other two patches got more testing from my side. ]

Thanks,

Joerg

Joerg Roedel (3):
  perf/core: Make sure the ring-buffer is mapped in all page-tables
  x86/entry/32: Check for VM86 mode in slow-path check
  x86/entry/32: Copy only ptregs on paranoid entry/exit path

 arch/x86/entry/entry_32.S   | 82 ++---
 kernel/events/ring_buffer.c | 10 ++
 2 files changed, 58 insertions(+), 34 deletions(-)

-- 
2.7.4



[PATCH 0/3] PTI for x86-32 Fixes and Updates

2018-07-20 Thread Joerg Roedel
Hi,

here are 3 patches which update the PTI-x86-32 patches recently merged
into the tip-tree. The patches are ordered by importance:

Patch 1: Very important, it fixes a vmalloc-fault in NMI context
 when PTI is enabled. This is pretty unlikely to hit
 when starting perf on an idle machine, which is why I
 didn't find it earlier in my testing. I always started
 'perf top' first :/ But when I start 'perf top' last
 when the kernel-compile already runs, it hits almost
 immediatly.

Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
 to also take VM86 into account. This is not strictly
 necessary because the slow-path also works for VM86
 mode but it is not how the code was intended to work.
 And it breaks when Patch 3 is applied on-top.

Patch 3: Implement the reduced copying in the paranoid
 entry/exit path as suggested by Andy Lutomirski while
 reviewing version 7 of the original patches.

I have the x86/tip branch with these patches on-top running my test for
6h now, with no issues so far. So for now it looks like there are no
scheduling points or irq-enabled sections reached from the paranoid
entry/exit paths and we always return to the entry-stack we came from.

I keep the test running over the weekend at least.

Please review.

[ If Patch 1 looks good to the maintainers I suggest applying it soon,
  before too many linux-next testers run into this issue. It is actually
  the reason why I send out the patches _now_ and didn't wait until next
  week when the other two patches got more testing from my side. ]

Thanks,

Joerg

Joerg Roedel (3):
  perf/core: Make sure the ring-buffer is mapped in all page-tables
  x86/entry/32: Check for VM86 mode in slow-path check
  x86/entry/32: Copy only ptregs on paranoid entry/exit path

 arch/x86/entry/entry_32.S   | 82 ++---
 kernel/events/ring_buffer.c | 10 ++
 2 files changed, 58 insertions(+), 34 deletions(-)

-- 
2.7.4