Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon, 2018-07-23 at 09:29 +0200, Joerg Roedel wrote: > Hey David, > > On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote: > > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA > > driver. (This is the same VM definition I shared with you in a PM > > back on Feb. 20th, except note that 4.18 kernels won't successfully > > boot with QEMU's IDE device, so I'm using SATA instead. That's a > > regression totally unrelated to your change sets, or to the general > > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as > > well.) > > Yes, this needs the fixes in the tip/x86/mm branch as well. Can you > that branch in and test again, please? Sorry, I didn't realize I needed those changes, too. I've re-tested with those applied and haven't encountered any issues. I'm now re-testing again with your newer patch set from the 25th. No issues so far with those, either; I'll confirm in that email thread after the laptop has seen some more use. Dave
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon, 2018-07-23 at 09:29 +0200, Joerg Roedel wrote: > Hey David, > > On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote: > > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA > > driver. (This is the same VM definition I shared with you in a PM > > back on Feb. 20th, except note that 4.18 kernels won't successfully > > boot with QEMU's IDE device, so I'm using SATA instead. That's a > > regression totally unrelated to your change sets, or to the general > > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as > > well.) > > Yes, this needs the fixes in the tip/x86/mm branch as well. Can you > that branch in and test again, please? Sorry, I didn't realize I needed those changes, too. I've re-tested with those applied and haven't encountered any issues. I'm now re-testing again with your newer patch set from the 25th. No issues so far with those, either; I'll confirm in that email thread after the laptop has seen some more use. Dave
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon 2018-07-23 14:50:50, Andy Lutomirski wrote: > > > > On Jul 23, 2018, at 2:38 PM, Pavel Machek wrote: > > > >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: > >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > >>> > >>> Meanwhile... it looks like gcc is not slowed down significantly, but > >>> other stuff sees 30% .. 40% slowdowns... which is rather > >>> significant. > >> > >> That is more or less expected. > >> > >> Gcc spends about 90+% of its time in user space, and the system calls > >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's > >> no longer have the pipe between cpp and cc1, so they don't have that > >> issue either (which would have sjhown the PTI slowdown a lot more) > >> > >> Some other loads will do a lot more time traversing the user/kernel > >> boundary, and in 32-bit mode you won't be able to take advantage of > >> the address space ID's, so you really get the full effect. > > > > Understood. Just -- bzip2 should include quite a lot of time in > > userspace, too. > > > >>> Would it be possible to have per-process control of kpti? I have > >>> some processes where trading of speed for security would make sense. > >> > >> That was pretty extensively discussed, and no sane model for it was > >> ever agreed upon. Some people wanted it per-thread, others per-mm, > >> and it wasn't clear how to set it either and how it should inherit > >> across fork/exec, and what the namespace rules etc should be. > >> > >> You absolutely need to inherit it (so that you can say "I trust this > >> session" or whatever), but at the same time you *don't* want to > >> inherit if you have a server you trust that then spawns user processes > >> (think "I want systemd to not have the overhead, but the user > >> processes it spawns obviously do need protection"). > >> > >> It was just a morass. Nothing came out of it. I guess people can > >> discuss it again, but it's not simple. > > > > I agree it is not easy. OTOH -- 30% of user-visible performance is a > > _lot_. That is worth spending man-years on... Ok, problem is not as > > severe on modern CPUs with address space ID's, but... > > > > What I want is "if A can ptrace B, and B has pti disabled, A can have > > pti disabled as well". Now.. I see someone may want to have it > > per-thread, because for stuff like javascript JIT, thread may have > > rights to call ptrace, but is unable to call ptrace because JIT > > removed that ability... hmm... > > No, you don’t want that. The problem is that Meltdown isn’t a problem that > exists in isolation. It’s very plausible that JavaScript code could trigger a > speculation attack that, with PTI off, could read kernel memory. Ok, you are right. It is more tricky then I thought. Still, I probably don't need to run grep's and cat's with PTI on. Chromium (etc) probably needs it. Python interpretter running my own code probably does not. Yes, my Thinkpad X60 is probably thermal-throttled. It is not really new machine. I switched to T40p for now :-). What is the "worst" case people are seeing? time dd if=/dev/zero of=/dev/null bs=1 count=1000 can reproduce 3x slowdown, but that's basically microbenchmark. root-only per-process enable/disable of kpti should not be too hard to do. Would there be interest in that? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon 2018-07-23 14:50:50, Andy Lutomirski wrote: > > > > On Jul 23, 2018, at 2:38 PM, Pavel Machek wrote: > > > >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: > >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > >>> > >>> Meanwhile... it looks like gcc is not slowed down significantly, but > >>> other stuff sees 30% .. 40% slowdowns... which is rather > >>> significant. > >> > >> That is more or less expected. > >> > >> Gcc spends about 90+% of its time in user space, and the system calls > >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's > >> no longer have the pipe between cpp and cc1, so they don't have that > >> issue either (which would have sjhown the PTI slowdown a lot more) > >> > >> Some other loads will do a lot more time traversing the user/kernel > >> boundary, and in 32-bit mode you won't be able to take advantage of > >> the address space ID's, so you really get the full effect. > > > > Understood. Just -- bzip2 should include quite a lot of time in > > userspace, too. > > > >>> Would it be possible to have per-process control of kpti? I have > >>> some processes where trading of speed for security would make sense. > >> > >> That was pretty extensively discussed, and no sane model for it was > >> ever agreed upon. Some people wanted it per-thread, others per-mm, > >> and it wasn't clear how to set it either and how it should inherit > >> across fork/exec, and what the namespace rules etc should be. > >> > >> You absolutely need to inherit it (so that you can say "I trust this > >> session" or whatever), but at the same time you *don't* want to > >> inherit if you have a server you trust that then spawns user processes > >> (think "I want systemd to not have the overhead, but the user > >> processes it spawns obviously do need protection"). > >> > >> It was just a morass. Nothing came out of it. I guess people can > >> discuss it again, but it's not simple. > > > > I agree it is not easy. OTOH -- 30% of user-visible performance is a > > _lot_. That is worth spending man-years on... Ok, problem is not as > > severe on modern CPUs with address space ID's, but... > > > > What I want is "if A can ptrace B, and B has pti disabled, A can have > > pti disabled as well". Now.. I see someone may want to have it > > per-thread, because for stuff like javascript JIT, thread may have > > rights to call ptrace, but is unable to call ptrace because JIT > > removed that ability... hmm... > > No, you don’t want that. The problem is that Meltdown isn’t a problem that > exists in isolation. It’s very plausible that JavaScript code could trigger a > speculation attack that, with PTI off, could read kernel memory. Ok, you are right. It is more tricky then I thought. Still, I probably don't need to run grep's and cat's with PTI on. Chromium (etc) probably needs it. Python interpretter running my own code probably does not. Yes, my Thinkpad X60 is probably thermal-throttled. It is not really new machine. I switched to T40p for now :-). What is the "worst" case people are seeing? time dd if=/dev/zero of=/dev/null bs=1 count=1000 can reproduce 3x slowdown, but that's basically microbenchmark. root-only per-process enable/disable of kpti should not be too hard to do. Would there be interest in that? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
> On Jul 24, 2018, at 6:39 AM, Pavel Machek wrote: > >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: >>> >>> Meanwhile... it looks like gcc is not slowed down significantly, but >>> other stuff sees 30% .. 40% slowdowns... which is rather >>> significant. >> >> That is more or less expected. > > Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a > loop, I get (on v4.18) that, too. > > ... The obvious cause would be thermal issues, which are increasingly common in laptops. You could get cycle counts from perf stat, perhaps.
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
> On Jul 24, 2018, at 6:39 AM, Pavel Machek wrote: > >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: >>> >>> Meanwhile... it looks like gcc is not slowed down significantly, but >>> other stuff sees 30% .. 40% slowdowns... which is rather >>> significant. >> >> That is more or less expected. > > Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a > loop, I get (on v4.18) that, too. > > ... The obvious cause would be thermal issues, which are increasingly common in laptops. You could get cycle counts from perf stat, perhaps.
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: > On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > > > > Meanwhile... it looks like gcc is not slowed down significantly, but > > other stuff sees 30% .. 40% slowdowns... which is rather > > significant. > > That is more or less expected. Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a loop, I get (on v4.18) that, too. That tells me that something is wrong with machine I'm using for benchmarking. Whether KPTI is enabled can still be measured with the bzip2 pipeline, but the effect is far more subtle. Pavel pavel@amd:~$ while true; do time cat /dev/urandom | head -c 1000 | bzip2 -9 - | wc -c ; done10044031 3.87user 0.91system 4.62 (0m4.622s) elapsed 103.48%CPU 10044234 4.03user 0.82system 4.68 (0m4.688s) elapsed 103.67%CPU 10043664 4.28user 0.85system 4.99 (0m4.994s) elapsed 102.90%CPU 10045959 4.43user 0.85system 5.12 (0m5.121s) elapsed 103.44%CPU 10043829 4.50user 0.89system 5.22 (0m5.228s) elapsed 103.22%CPU 10044296 4.65user 0.93system 5.39 (0m5.398s) elapsed 103.61%CPU 10045311 4.76user 0.93system 5.47 (0m5.479s) elapsed 103.98%CPU 10043819 4.81user 0.93system 5.55 (0m5.556s) elapsed 103.37%CPU 10045097 4.72user 1.04system 5.59 (0m5.597s) elapsed 103.01%CPU 10044012 4.86user 0.97system 5.68 (0m5.684s) elapsed 102.79%CPU 10044569 4.93user 0.96system 5.72 (0m5.728s) elapsed 102.92%CPU 10044141 4.94user 0.98system 5.75 (0m5.752s) elapsed 102.97%CPU 10043695 4.97user 0.95system 5.76 (0m5.768s) elapsed 102.87%CPU 10045690 5.12user 0.94system 5.90 (0m5.901s) elapsed 102.79%CPU 10045153 5.06user 1.00system 5.88 (0m5.883s) elapsed 103.21%CPU 10044560 5.10user 1.01system 5.92 (0m5.927s) elapsed 103.31%CPU 10044845 5.17user 0.99system 5.96 (0m5.960s) elapsed 103.44%CPU 10043884 5.15user 1.03system 6.00 (0m6.004s) elapsed 103.14%CPU 10044286 5.18user 1.01system 6.00 (0m6.002s) elapsed 103.40%CPU 10045749 5.00user 1.22system 6.04 (0m6.044s) elapsed 102.98%CPU 10044098 5.22user 1.02system 6.05 (0m6.053s) elapsed 103.21%CPU 10045326 5.20user 1.01system 6.04 (0m6.048s) elapsed 102.72%CPU 10042365 5.22user 1.03system 6.06 (0m6.061s) elapsed 103.30%CPU 10043952 5.24user 1.00system 6.06 (0m6.069s) elapsed 102.97%CPU 10044569 5.30user 1.00system 6.09 (0m6.099s) elapsed 103.46%CPU 10043241 5.26user 1.00system 6.09 (0m6.097s) elapsed 102.79%CPU 10044797 5.30user 1.01system 6.11 (0m6.114s) elapsed 103.46%CPU 10043711 5.25user 1.02system 6.09 (0m6.093s) elapsed 103.03%CPU 10043882 5.31user 1.01system 6.13 (0m6.131s) elapsed 103.28%CPU 10043571 5.26user 1.05system 6.13 (0m6.133s) elapsed 103.06%CPU 10044742 5.29user 1.03system 6.12 (0m6.122s) elapsed 103.25%CPU 10044170 5.35user 1.04system 6.18 (0m6.183s) elapsed 103.60%CPU 10043542 5.22user 1.12system 6.17 (0m6.172s) elapsed 102.89%CPU 10042985 5.25user 1.13system 6.19 (0m6.193s) elapsed 103.09%CPU 10044102 5.36user 1.01system 6.17 (0m6.177s) elapsed 103.17%CPU 10044609 5.48user 0.99system 6.28 (0m6.284s) elapsed 103.11%CPU 10045185 5.40user 1.03system 6.23 (0m6.236s) elapsed 103.29%CPU 1004 5.41user 1.06system 6.25 (0m6.255s) elapsed 103.55%CPU 10044859 5.35user 1.04system 6.20 (0m6.201s) elapsed 103.17%CPU 10045613 -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: > On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > > > > Meanwhile... it looks like gcc is not slowed down significantly, but > > other stuff sees 30% .. 40% slowdowns... which is rather > > significant. > > That is more or less expected. Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a loop, I get (on v4.18) that, too. That tells me that something is wrong with machine I'm using for benchmarking. Whether KPTI is enabled can still be measured with the bzip2 pipeline, but the effect is far more subtle. Pavel pavel@amd:~$ while true; do time cat /dev/urandom | head -c 1000 | bzip2 -9 - | wc -c ; done10044031 3.87user 0.91system 4.62 (0m4.622s) elapsed 103.48%CPU 10044234 4.03user 0.82system 4.68 (0m4.688s) elapsed 103.67%CPU 10043664 4.28user 0.85system 4.99 (0m4.994s) elapsed 102.90%CPU 10045959 4.43user 0.85system 5.12 (0m5.121s) elapsed 103.44%CPU 10043829 4.50user 0.89system 5.22 (0m5.228s) elapsed 103.22%CPU 10044296 4.65user 0.93system 5.39 (0m5.398s) elapsed 103.61%CPU 10045311 4.76user 0.93system 5.47 (0m5.479s) elapsed 103.98%CPU 10043819 4.81user 0.93system 5.55 (0m5.556s) elapsed 103.37%CPU 10045097 4.72user 1.04system 5.59 (0m5.597s) elapsed 103.01%CPU 10044012 4.86user 0.97system 5.68 (0m5.684s) elapsed 102.79%CPU 10044569 4.93user 0.96system 5.72 (0m5.728s) elapsed 102.92%CPU 10044141 4.94user 0.98system 5.75 (0m5.752s) elapsed 102.97%CPU 10043695 4.97user 0.95system 5.76 (0m5.768s) elapsed 102.87%CPU 10045690 5.12user 0.94system 5.90 (0m5.901s) elapsed 102.79%CPU 10045153 5.06user 1.00system 5.88 (0m5.883s) elapsed 103.21%CPU 10044560 5.10user 1.01system 5.92 (0m5.927s) elapsed 103.31%CPU 10044845 5.17user 0.99system 5.96 (0m5.960s) elapsed 103.44%CPU 10043884 5.15user 1.03system 6.00 (0m6.004s) elapsed 103.14%CPU 10044286 5.18user 1.01system 6.00 (0m6.002s) elapsed 103.40%CPU 10045749 5.00user 1.22system 6.04 (0m6.044s) elapsed 102.98%CPU 10044098 5.22user 1.02system 6.05 (0m6.053s) elapsed 103.21%CPU 10045326 5.20user 1.01system 6.04 (0m6.048s) elapsed 102.72%CPU 10042365 5.22user 1.03system 6.06 (0m6.061s) elapsed 103.30%CPU 10043952 5.24user 1.00system 6.06 (0m6.069s) elapsed 102.97%CPU 10044569 5.30user 1.00system 6.09 (0m6.099s) elapsed 103.46%CPU 10043241 5.26user 1.00system 6.09 (0m6.097s) elapsed 102.79%CPU 10044797 5.30user 1.01system 6.11 (0m6.114s) elapsed 103.46%CPU 10043711 5.25user 1.02system 6.09 (0m6.093s) elapsed 103.03%CPU 10043882 5.31user 1.01system 6.13 (0m6.131s) elapsed 103.28%CPU 10043571 5.26user 1.05system 6.13 (0m6.133s) elapsed 103.06%CPU 10044742 5.29user 1.03system 6.12 (0m6.122s) elapsed 103.25%CPU 10044170 5.35user 1.04system 6.18 (0m6.183s) elapsed 103.60%CPU 10043542 5.22user 1.12system 6.17 (0m6.172s) elapsed 102.89%CPU 10042985 5.25user 1.13system 6.19 (0m6.193s) elapsed 103.09%CPU 10044102 5.36user 1.01system 6.17 (0m6.177s) elapsed 103.17%CPU 10044609 5.48user 0.99system 6.28 (0m6.284s) elapsed 103.11%CPU 10045185 5.40user 1.03system 6.23 (0m6.236s) elapsed 103.29%CPU 1004 5.41user 1.06system 6.25 (0m6.255s) elapsed 103.55%CPU 10044859 5.35user 1.04system 6.20 (0m6.201s) elapsed 103.17%CPU 10045613 -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On 07/23/2018 02:59 PM, Josh Poimboeuf wrote: > On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote: >> But for now I'd like at least "global" option of turning pti on/off >> during runtime for benchmarking. Let me see... >> >> Something like this, or is it going to be way more complex? Does >> anyone have patch by chance? > RHEL/CentOS has a global PTI enable/disable, which uses stop_machine(). Let's not forget PTI's NX-for-userspace in the kernel page tables. That provides Spectre V2 mitigation as well as Meltdown.
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On 07/23/2018 02:59 PM, Josh Poimboeuf wrote: > On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote: >> But for now I'd like at least "global" option of turning pti on/off >> during runtime for benchmarking. Let me see... >> >> Something like this, or is it going to be way more complex? Does >> anyone have patch by chance? > RHEL/CentOS has a global PTI enable/disable, which uses stop_machine(). Let's not forget PTI's NX-for-userspace in the kernel page tables. That provides Spectre V2 mitigation as well as Meltdown.
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote: > But for now I'd like at least "global" option of turning pti on/off > during runtime for benchmarking. Let me see... > > Something like this, or is it going to be way more complex? Does > anyone have patch by chance? RHEL/CentOS has a global PTI enable/disable, which uses stop_machine(). -- Josh
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote: > But for now I'd like at least "global" option of turning pti on/off > during runtime for benchmarking. Let me see... > > Something like this, or is it going to be way more complex? Does > anyone have patch by chance? RHEL/CentOS has a global PTI enable/disable, which uses stop_machine(). -- Josh
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
Hi! > > What I want is "if A can ptrace B, and B has pti disabled, A can have > > pti disabled as well". Now.. I see someone may want to have it > > per-thread, because for stuff like javascript JIT, thread may have > > rights to call ptrace, but is unable to call ptrace because JIT > > removed that ability... hmm... > > No, you don’t want that. The problem is that Meltdown isn’t a problem that > exists in isolation. It’s very plausible that JavaScript code could trigger a > speculation attack that, with PTI off, could read kernel memory. Yeah, the web browser threads that run javascript code should have PTI on. But maybe I want the rest of web browser with PTI off. So... yes, I see why someone may want it per-thread (and not per-process). I guess per-process would be good enough for me. Actually, maybe even per-uid. I don't have any fancy security here, so anything running uid 0 and 1000 is close enough to trusted. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
Hi! > > What I want is "if A can ptrace B, and B has pti disabled, A can have > > pti disabled as well". Now.. I see someone may want to have it > > per-thread, because for stuff like javascript JIT, thread may have > > rights to call ptrace, but is unable to call ptrace because JIT > > removed that ability... hmm... > > No, you don’t want that. The problem is that Meltdown isn’t a problem that > exists in isolation. It’s very plausible that JavaScript code could trigger a > speculation attack that, with PTI off, could read kernel memory. Yeah, the web browser threads that run javascript code should have PTI on. But maybe I want the rest of web browser with PTI off. So... yes, I see why someone may want it per-thread (and not per-process). I guess per-process would be good enough for me. Actually, maybe even per-uid. I don't have any fancy security here, so anything running uid 0 and 1000 is close enough to trusted. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
> On Jul 23, 2018, at 2:38 PM, Pavel Machek wrote: > >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: >>> >>> Meanwhile... it looks like gcc is not slowed down significantly, but >>> other stuff sees 30% .. 40% slowdowns... which is rather >>> significant. >> >> That is more or less expected. >> >> Gcc spends about 90+% of its time in user space, and the system calls >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's >> no longer have the pipe between cpp and cc1, so they don't have that >> issue either (which would have sjhown the PTI slowdown a lot more) >> >> Some other loads will do a lot more time traversing the user/kernel >> boundary, and in 32-bit mode you won't be able to take advantage of >> the address space ID's, so you really get the full effect. > > Understood. Just -- bzip2 should include quite a lot of time in > userspace, too. > >>> Would it be possible to have per-process control of kpti? I have >>> some processes where trading of speed for security would make sense. >> >> That was pretty extensively discussed, and no sane model for it was >> ever agreed upon. Some people wanted it per-thread, others per-mm, >> and it wasn't clear how to set it either and how it should inherit >> across fork/exec, and what the namespace rules etc should be. >> >> You absolutely need to inherit it (so that you can say "I trust this >> session" or whatever), but at the same time you *don't* want to >> inherit if you have a server you trust that then spawns user processes >> (think "I want systemd to not have the overhead, but the user >> processes it spawns obviously do need protection"). >> >> It was just a morass. Nothing came out of it. I guess people can >> discuss it again, but it's not simple. > > I agree it is not easy. OTOH -- 30% of user-visible performance is a > _lot_. That is worth spending man-years on... Ok, problem is not as > severe on modern CPUs with address space ID's, but... > > What I want is "if A can ptrace B, and B has pti disabled, A can have > pti disabled as well". Now.. I see someone may want to have it > per-thread, because for stuff like javascript JIT, thread may have > rights to call ptrace, but is unable to call ptrace because JIT > removed that ability... hmm... No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
> On Jul 23, 2018, at 2:38 PM, Pavel Machek wrote: > >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: >>> >>> Meanwhile... it looks like gcc is not slowed down significantly, but >>> other stuff sees 30% .. 40% slowdowns... which is rather >>> significant. >> >> That is more or less expected. >> >> Gcc spends about 90+% of its time in user space, and the system calls >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's >> no longer have the pipe between cpp and cc1, so they don't have that >> issue either (which would have sjhown the PTI slowdown a lot more) >> >> Some other loads will do a lot more time traversing the user/kernel >> boundary, and in 32-bit mode you won't be able to take advantage of >> the address space ID's, so you really get the full effect. > > Understood. Just -- bzip2 should include quite a lot of time in > userspace, too. > >>> Would it be possible to have per-process control of kpti? I have >>> some processes where trading of speed for security would make sense. >> >> That was pretty extensively discussed, and no sane model for it was >> ever agreed upon. Some people wanted it per-thread, others per-mm, >> and it wasn't clear how to set it either and how it should inherit >> across fork/exec, and what the namespace rules etc should be. >> >> You absolutely need to inherit it (so that you can say "I trust this >> session" or whatever), but at the same time you *don't* want to >> inherit if you have a server you trust that then spawns user processes >> (think "I want systemd to not have the overhead, but the user >> processes it spawns obviously do need protection"). >> >> It was just a morass. Nothing came out of it. I guess people can >> discuss it again, but it's not simple. > > I agree it is not easy. OTOH -- 30% of user-visible performance is a > _lot_. That is worth spending man-years on... Ok, problem is not as > severe on modern CPUs with address space ID's, but... > > What I want is "if A can ptrace B, and B has pti disabled, A can have > pti disabled as well". Now.. I see someone may want to have it > per-thread, because for stuff like javascript JIT, thread may have > rights to call ptrace, but is unable to call ptrace because JIT > removed that ability... hmm... No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: > On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > > > > Meanwhile... it looks like gcc is not slowed down significantly, but > > other stuff sees 30% .. 40% slowdowns... which is rather > > significant. > > That is more or less expected. > > Gcc spends about 90+% of its time in user space, and the system calls > it *does* do tend to be "real work" (open/read/etc). And modern gcc's > no longer have the pipe between cpp and cc1, so they don't have that > issue either (which would have sjhown the PTI slowdown a lot more) > > Some other loads will do a lot more time traversing the user/kernel > boundary, and in 32-bit mode you won't be able to take advantage of > the address space ID's, so you really get the full effect. Understood. Just -- bzip2 should include quite a lot of time in userspace, too. > > Would it be possible to have per-process control of kpti? I have > > some processes where trading of speed for security would make sense. > > That was pretty extensively discussed, and no sane model for it was > ever agreed upon. Some people wanted it per-thread, others per-mm, > and it wasn't clear how to set it either and how it should inherit > across fork/exec, and what the namespace rules etc should be. > > You absolutely need to inherit it (so that you can say "I trust this > session" or whatever), but at the same time you *don't* want to > inherit if you have a server you trust that then spawns user processes > (think "I want systemd to not have the overhead, but the user > processes it spawns obviously do need protection"). > > It was just a morass. Nothing came out of it. I guess people can > discuss it again, but it's not simple. I agree it is not easy. OTOH -- 30% of user-visible performance is a _lot_. That is worth spending man-years on... Ok, problem is not as severe on modern CPUs with address space ID's, but... What I want is "if A can ptrace B, and B has pti disabled, A can have pti disabled as well". Now.. I see someone may want to have it per-thread, because for stuff like javascript JIT, thread may have rights to call ptrace, but is unable to call ptrace because JIT removed that ability... hmm... But for now I'd like at least "global" option of turning pti on/off during runtime for benchmarking. Let me see... Something like this, or is it going to be way more complex? Does anyone have patch by chance? Pavel diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index dfb975b..719e39a 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -162,6 +162,9 @@ .macro SWITCH_TO_USER_CR3 scratch_reg:req ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lend_\@ + movl%cr3, \scratch_reg orl $PTI_SWITCH_MASK, \scratch_reg movl\scratch_reg, %cr3 @@ -176,6 +179,8 @@ testl $SEGMENT_RPL_MASK, PT_CS(%esp) jz .Lend_\@ .endif + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lend_\@ /* On user-cr3? */ movl%cr3, %eax testl $PTI_SWITCH_MASK, %eax @@ -192,6 +197,10 @@ */ .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI + + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lend_\@ + movl%cr3, \scratch_reg /* Test if we are already on kernel CR3 */ testl $PTI_SWITCH_MASK, \scratch_reg @@ -302,6 +311,9 @@ */ ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lswitched_\@ + testl $PTI_SWITCH_MASK, \cr3_reg jz .Lswitched_\@ diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h index 4a7884b..8c92ae2 100644 --- a/arch/x86/include/asm/cpu_entry_area.h +++ b/arch/x86/include/asm/cpu_entry_area.h @@ -59,6 +59,7 @@ struct cpu_entry_area { #define CPU_ENTRY_AREA_TOT_SIZE(CPU_ENTRY_AREA_SIZE * NR_CPUS) DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area); +DECLARE_PER_CPU(int, pti_enabled); extern void setup_cpu_entry_areas(void); extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags); diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f73fa6f..da34a21 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -507,6 +507,9 @@ void load_percpu_segment(int cpu) DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area); #endif +DEFINE_PER_CPU(int, pti_enabled); + + #ifdef CONFIG_X86_64 /* * Special IST stacks which the CPU switches to when it calls -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon 2018-07-23 12:00:08, Linus Torvalds wrote: > On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > > > > Meanwhile... it looks like gcc is not slowed down significantly, but > > other stuff sees 30% .. 40% slowdowns... which is rather > > significant. > > That is more or less expected. > > Gcc spends about 90+% of its time in user space, and the system calls > it *does* do tend to be "real work" (open/read/etc). And modern gcc's > no longer have the pipe between cpp and cc1, so they don't have that > issue either (which would have sjhown the PTI slowdown a lot more) > > Some other loads will do a lot more time traversing the user/kernel > boundary, and in 32-bit mode you won't be able to take advantage of > the address space ID's, so you really get the full effect. Understood. Just -- bzip2 should include quite a lot of time in userspace, too. > > Would it be possible to have per-process control of kpti? I have > > some processes where trading of speed for security would make sense. > > That was pretty extensively discussed, and no sane model for it was > ever agreed upon. Some people wanted it per-thread, others per-mm, > and it wasn't clear how to set it either and how it should inherit > across fork/exec, and what the namespace rules etc should be. > > You absolutely need to inherit it (so that you can say "I trust this > session" or whatever), but at the same time you *don't* want to > inherit if you have a server you trust that then spawns user processes > (think "I want systemd to not have the overhead, but the user > processes it spawns obviously do need protection"). > > It was just a morass. Nothing came out of it. I guess people can > discuss it again, but it's not simple. I agree it is not easy. OTOH -- 30% of user-visible performance is a _lot_. That is worth spending man-years on... Ok, problem is not as severe on modern CPUs with address space ID's, but... What I want is "if A can ptrace B, and B has pti disabled, A can have pti disabled as well". Now.. I see someone may want to have it per-thread, because for stuff like javascript JIT, thread may have rights to call ptrace, but is unable to call ptrace because JIT removed that ability... hmm... But for now I'd like at least "global" option of turning pti on/off during runtime for benchmarking. Let me see... Something like this, or is it going to be way more complex? Does anyone have patch by chance? Pavel diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index dfb975b..719e39a 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -162,6 +162,9 @@ .macro SWITCH_TO_USER_CR3 scratch_reg:req ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lend_\@ + movl%cr3, \scratch_reg orl $PTI_SWITCH_MASK, \scratch_reg movl\scratch_reg, %cr3 @@ -176,6 +179,8 @@ testl $SEGMENT_RPL_MASK, PT_CS(%esp) jz .Lend_\@ .endif + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lend_\@ /* On user-cr3? */ movl%cr3, %eax testl $PTI_SWITCH_MASK, %eax @@ -192,6 +197,10 @@ */ .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI + + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lend_\@ + movl%cr3, \scratch_reg /* Test if we are already on kernel CR3 */ testl $PTI_SWITCH_MASK, \scratch_reg @@ -302,6 +311,9 @@ */ ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI + cmpl$1, PER_CPU_VAR(pti_enabled) + jne .Lswitched_\@ + testl $PTI_SWITCH_MASK, \cr3_reg jz .Lswitched_\@ diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h index 4a7884b..8c92ae2 100644 --- a/arch/x86/include/asm/cpu_entry_area.h +++ b/arch/x86/include/asm/cpu_entry_area.h @@ -59,6 +59,7 @@ struct cpu_entry_area { #define CPU_ENTRY_AREA_TOT_SIZE(CPU_ENTRY_AREA_SIZE * NR_CPUS) DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area); +DECLARE_PER_CPU(int, pti_enabled); extern void setup_cpu_entry_areas(void); extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags); diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f73fa6f..da34a21 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -507,6 +507,9 @@ void load_percpu_segment(int cpu) DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area); #endif +DEFINE_PER_CPU(int, pti_enabled); + + #ifdef CONFIG_X86_64 /* * Special IST stacks which the CPU switches to when it calls -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > > Meanwhile... it looks like gcc is not slowed down significantly, but > other stuff sees 30% .. 40% slowdowns... which is rather > significant. That is more or less expected. Gcc spends about 90+% of its time in user space, and the system calls it *does* do tend to be "real work" (open/read/etc). And modern gcc's no longer have the pipe between cpp and cc1, so they don't have that issue either (which would have sjhown the PTI slowdown a lot more) Some other loads will do a lot more time traversing the user/kernel boundary, and in 32-bit mode you won't be able to take advantage of the address space ID's, so you really get the full effect. > Would it be possible to have per-process control of kpti? I have > some processes where trading of speed for security would make sense. That was pretty extensively discussed, and no sane model for it was ever agreed upon. Some people wanted it per-thread, others per-mm, and it wasn't clear how to set it either and how it should inherit across fork/exec, and what the namespace rules etc should be. You absolutely need to inherit it (so that you can say "I trust this session" or whatever), but at the same time you *don't* want to inherit if you have a server you trust that then spawns user processes (think "I want systemd to not have the overhead, but the user processes it spawns obviously do need protection"). It was just a morass. Nothing came out of it. I guess people can discuss it again, but it's not simple. Linus
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek wrote: > > Meanwhile... it looks like gcc is not slowed down significantly, but > other stuff sees 30% .. 40% slowdowns... which is rather > significant. That is more or less expected. Gcc spends about 90+% of its time in user space, and the system calls it *does* do tend to be "real work" (open/read/etc). And modern gcc's no longer have the pipe between cpp and cc1, so they don't have that issue either (which would have sjhown the PTI slowdown a lot more) Some other loads will do a lot more time traversing the user/kernel boundary, and in 32-bit mode you won't be able to take advantage of the address space ID's, so you really get the full effect. > Would it be possible to have per-process control of kpti? I have > some processes where trading of speed for security would make sense. That was pretty extensively discussed, and no sane model for it was ever agreed upon. Some people wanted it per-thread, others per-mm, and it wasn't clear how to set it either and how it should inherit across fork/exec, and what the namespace rules etc should be. You absolutely need to inherit it (so that you can say "I trust this session" or whatever), but at the same time you *don't* want to inherit if you have a server you trust that then spawns user processes (think "I want systemd to not have the overhead, but the user processes it spawns obviously do need protection"). It was just a morass. Nothing came out of it. I guess people can discuss it again, but it's not simple. Linus
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
Hi! > here are 3 patches which update the PTI-x86-32 patches recently merged > into the tip-tree. The patches are ordered by importance: It seems PTI is now in -next. I'll test that soon. Meanwhile... it looks like gcc is not slowed down significantly, but other stuff sees 30% .. 40% slowdowns... which is rather significant. Would it be possible to have per-process control of kpti? I have some processes where trading of speed for security would make sense. Best regards, Pavel cd ~/g/tui/nowcast time ./nowcast -x (30%) KPTI: 139.25user 73.65system 269.90 (4m29.901s) elapsed 78.88%CPU 133.35user 73.15system 228.80 (3m48.802s) elapsed 90.25%CPU 140.51user 74.21system 218.33 (3m38.338s) elapsed 98.34%CPU 133.85user 75.89system 212.02 (3m32.026s) elapsed 98.93%CPU (no chromium) 139.34user 75.00system 235.75 (3m55.752s) elapsed 90.92%CPU 4.18: 116.99user 43.79system 217.65 (3m37.653s) elapsed 73.87%CPU 115.14user 43.97system 178.85 (2m58.855s) elapsed 88.96%CPU 128.47user 47.22system 178.24 (2m58.245s) elapsed 98.57%CPU 132.30user 49.27system 184.40 (3m4.408s) elapsed 98.46%CPU 134.88user 48.59system 186.67 (3m6.673s) elapsed 98.29%CPU 132.15user 48.65system 524.68 (8m44.684s) elapsed 34.46%CPU 120.38user 45.45system 168.72 (2m48.720s) elapsed 98.29%CPU time cat /dev/urandom | head -c 1000 | bzip2 -9 - | wc -c (40%) v4.18: 4.57user 0.23system 4.64 (0m4.644s) elapsed 103.53%CPU 4.86user 0.23system 4.95 (0m4.952s) elapsed 102.81%CPU 5.13user 0.22system 5.19 (0m5.190s) elapsed 103.14%CPU KPTI: 6.39user 0.48system 6.74 (0m6.747s) elapsed 101.96%CPU 6.66user 0.41system 6.91 (0m6.912s) elapsed 102.51%CPU 6.53user 0.51system 6.91 (0m6.919s) elapsed 101.99%CPU v4l-utils: make clean, time make v4.18: 191.93user 11.00system 211.19 (3m31.191s) elapsed 96.09%CPU 221.21user 14.69system 248.73 (4m8.734s) elapsed 94.84%CPU 198.35user 11.61system 211.39 (3m31.392s) elapsed 99.32%CPU 204.87user 11.69system 217.97 (3m37.971s) elapsed 99.35%CPU 203.68user 11.88system 217.29 (3m37.291s) elapsed 99.20%CPU KPTI: 156.45user 40.08system 204.77 (3m24.777s) elapsed 95.97%CPU 183.32user 38.64system 225.03 (3m45.031s) elapsed 98.63%CPU -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
Hi! > here are 3 patches which update the PTI-x86-32 patches recently merged > into the tip-tree. The patches are ordered by importance: It seems PTI is now in -next. I'll test that soon. Meanwhile... it looks like gcc is not slowed down significantly, but other stuff sees 30% .. 40% slowdowns... which is rather significant. Would it be possible to have per-process control of kpti? I have some processes where trading of speed for security would make sense. Best regards, Pavel cd ~/g/tui/nowcast time ./nowcast -x (30%) KPTI: 139.25user 73.65system 269.90 (4m29.901s) elapsed 78.88%CPU 133.35user 73.15system 228.80 (3m48.802s) elapsed 90.25%CPU 140.51user 74.21system 218.33 (3m38.338s) elapsed 98.34%CPU 133.85user 75.89system 212.02 (3m32.026s) elapsed 98.93%CPU (no chromium) 139.34user 75.00system 235.75 (3m55.752s) elapsed 90.92%CPU 4.18: 116.99user 43.79system 217.65 (3m37.653s) elapsed 73.87%CPU 115.14user 43.97system 178.85 (2m58.855s) elapsed 88.96%CPU 128.47user 47.22system 178.24 (2m58.245s) elapsed 98.57%CPU 132.30user 49.27system 184.40 (3m4.408s) elapsed 98.46%CPU 134.88user 48.59system 186.67 (3m6.673s) elapsed 98.29%CPU 132.15user 48.65system 524.68 (8m44.684s) elapsed 34.46%CPU 120.38user 45.45system 168.72 (2m48.720s) elapsed 98.29%CPU time cat /dev/urandom | head -c 1000 | bzip2 -9 - | wc -c (40%) v4.18: 4.57user 0.23system 4.64 (0m4.644s) elapsed 103.53%CPU 4.86user 0.23system 4.95 (0m4.952s) elapsed 102.81%CPU 5.13user 0.22system 5.19 (0m5.190s) elapsed 103.14%CPU KPTI: 6.39user 0.48system 6.74 (0m6.747s) elapsed 101.96%CPU 6.66user 0.41system 6.91 (0m6.912s) elapsed 102.51%CPU 6.53user 0.51system 6.91 (0m6.919s) elapsed 101.99%CPU v4l-utils: make clean, time make v4.18: 191.93user 11.00system 211.19 (3m31.191s) elapsed 96.09%CPU 221.21user 14.69system 248.73 (4m8.734s) elapsed 94.84%CPU 198.35user 11.61system 211.39 (3m31.392s) elapsed 99.32%CPU 204.87user 11.69system 217.97 (3m37.971s) elapsed 99.35%CPU 203.68user 11.88system 217.29 (3m37.291s) elapsed 99.20%CPU KPTI: 156.45user 40.08system 204.77 (3m24.777s) elapsed 95.97%CPU 183.32user 38.64system 225.03 (3m45.031s) elapsed 98.63%CPU -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
Hey David, On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote: > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA > driver. (This is the same VM definition I shared with you in a PM > back on Feb. 20th, except note that 4.18 kernels won't successfully > boot with QEMU's IDE device, so I'm using SATA instead. That's a > regression totally unrelated to your change sets, or to the general > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.) Yes, this needs the fixes in the tip/x86/mm branch as well. Can you that branch in and test again, please? Thanks, Joerg
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
Hey David, On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote: > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA > driver. (This is the same VM definition I shared with you in a PM > back on Feb. 20th, except note that 4.18 kernels won't successfully > boot with QEMU's IDE device, so I'm using SATA instead. That's a > regression totally unrelated to your change sets, or to the general > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.) Yes, this needs the fixes in the tip/x86/mm branch as well. Can you that branch in and test again, please? Thanks, Joerg
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Fri, 2018-07-20 at 18:22 +0200, Joerg Roedel wrote: > Hi, > > here are 3 patches which update the PTI-x86-32 patches recently merged > into the tip-tree. The patches are ordered by importance: > > Patch 1: Very important, it fixes a vmalloc-fault in NMI context >when PTI is enabled. This is pretty unlikely to hit >when starting perf on an idle machine, which is why I >didn't find it earlier in my testing. I always started >'perf top' first :/ But when I start 'perf top' last >when the kernel-compile already runs, it hits almost >immediatly. > > Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK >to also take VM86 into account. This is not strictly >necessary because the slow-path also works for VM86 >mode but it is not how the code was intended to work. >And it breaks when Patch 3 is applied on-top. > > Patch 3: Implement the reduced copying in the paranoid >entry/exit path as suggested by Andy Lutomirski while >reviewing version 7 of the original patches. > > I have the x86/tip branch with these patches on-top running my test > for > 6h now, with no issues so far. So for now it looks like there are no > scheduling points or irq-enabled sections reached from the paranoid > entry/exit paths and we always return to the entry-stack we came from. > > I keep the test running over the weekend at least. > > Please review. > > [ If Patch 1 looks good to the maintainers I suggest applying it soon, > before too many linux-next testers run into this issue. It is > actually > the reason why I send out the patches _now_ and didn't wait until > next > week when the other two patches got more testing from my side. ] > > Thanks, > > Joerg > > Joerg Roedel (3): > perf/core: Make sure the ring-buffer is mapped in all page-tables > x86/entry/32: Check for VM86 mode in slow-path check > x86/entry/32: Copy only ptregs on paranoid entry/exit path > > arch/x86/entry/entry_32.S | 82 ++ > --- > kernel/events/ring_buffer.c | 10 ++ > 2 files changed, 58 insertions(+), 34 deletions(-) Hi Joerg, I tested again using the tip "x86/pti" branch (with two of the three patches in your change set already applied), and manually applied your third patch on top of it (I see there was some debate about it, but I thought I'd include it), plus I had to manually apply the patch to fix booting (d1b47a7c9efcf3c3384b70f6e3c8f1423b44d8c7: "mm: don't do zero_resv_unavail if memmap is not allocated"), since "x86/pti" doesn't include it yet. Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA driver. (This is the same VM definition I shared with you in a PM back on Feb. 20th, except note that 4.18 kernels won't successfully boot with QEMU's IDE device, so I'm using SATA instead. That's a regression totally unrelated to your change sets, or to the general booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.) [drm] Found bochs VGA, ID 0xb0c0. [drm] Framebuffer size 16384 kB @ 0xfd00, mmio @ 0xfebd4000. [TTM] Zone kernel: Available graphics memory: 390536 kiB [TTM] Zone highmem: Available graphics memory: 4659530 kiB [TTM] Initializing pool allocator [TTM] Initializing DMA pool allocator [ cut here ] kernel BUG at arch/x86/mm/fault.c:269! invalid opcode: [#1] SMP PTI CPU: 0 PID: 349 Comm: systemd-udevd Tainted: GW 4.18.0- rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 EIP: vmalloc_fault+0x1d7/0x200 Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00 00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3 EAX: 02788000 EBX: c85b6de8 ECX: 0080 EDX: ESI: 000fd000 EDI: fdf3 EBP: f4743994 ESP: f474397c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083 CR0: 80050033 CR2: f7a0 CR3: 347f6000 CR4: 06f0 Call Trace: __do_page_fault+0x340/0x4c0 do_page_fault+0x25/0xf0 ? ttm_mem_reg_ioremap+0xe5/0x100 [ttm] ? kvm_async_pf_task_wait+0x1b0/0x1b0 do_async_page_fault+0x55/0x80 common_exception+0x13f/0x146 EIP: memset+0xb/0x20 Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 aa 89 d8 5b 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 90 66 EAX: EBX: f7a0 ECX: 0030 EDX: ESI: f4743b9c EDI: f7a0 EBP: f4743a4c ESP: f4743a44 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206 ttm_bo_move_memcpy+0x49a/0x4c0 [ttm] ? _cond_resched+0x17/0x40 ttm_bo_handle_move_mem+0x554/0x570 [ttm] ? ttm_bo_mem_space+0x211/0x440 [ttm] ttm_bo_validate+0xf5/0x110 [ttm] bochs_bo_pin+0xde/0x1c0 [bochs_drm]
Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
On Fri, 2018-07-20 at 18:22 +0200, Joerg Roedel wrote: > Hi, > > here are 3 patches which update the PTI-x86-32 patches recently merged > into the tip-tree. The patches are ordered by importance: > > Patch 1: Very important, it fixes a vmalloc-fault in NMI context >when PTI is enabled. This is pretty unlikely to hit >when starting perf on an idle machine, which is why I >didn't find it earlier in my testing. I always started >'perf top' first :/ But when I start 'perf top' last >when the kernel-compile already runs, it hits almost >immediatly. > > Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK >to also take VM86 into account. This is not strictly >necessary because the slow-path also works for VM86 >mode but it is not how the code was intended to work. >And it breaks when Patch 3 is applied on-top. > > Patch 3: Implement the reduced copying in the paranoid >entry/exit path as suggested by Andy Lutomirski while >reviewing version 7 of the original patches. > > I have the x86/tip branch with these patches on-top running my test > for > 6h now, with no issues so far. So for now it looks like there are no > scheduling points or irq-enabled sections reached from the paranoid > entry/exit paths and we always return to the entry-stack we came from. > > I keep the test running over the weekend at least. > > Please review. > > [ If Patch 1 looks good to the maintainers I suggest applying it soon, > before too many linux-next testers run into this issue. It is > actually > the reason why I send out the patches _now_ and didn't wait until > next > week when the other two patches got more testing from my side. ] > > Thanks, > > Joerg > > Joerg Roedel (3): > perf/core: Make sure the ring-buffer is mapped in all page-tables > x86/entry/32: Check for VM86 mode in slow-path check > x86/entry/32: Copy only ptregs on paranoid entry/exit path > > arch/x86/entry/entry_32.S | 82 ++ > --- > kernel/events/ring_buffer.c | 10 ++ > 2 files changed, 58 insertions(+), 34 deletions(-) Hi Joerg, I tested again using the tip "x86/pti" branch (with two of the three patches in your change set already applied), and manually applied your third patch on top of it (I see there was some debate about it, but I thought I'd include it), plus I had to manually apply the patch to fix booting (d1b47a7c9efcf3c3384b70f6e3c8f1423b44d8c7: "mm: don't do zero_resv_unavail if memmap is not allocated"), since "x86/pti" doesn't include it yet. Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA driver. (This is the same VM definition I shared with you in a PM back on Feb. 20th, except note that 4.18 kernels won't successfully boot with QEMU's IDE device, so I'm using SATA instead. That's a regression totally unrelated to your change sets, or to the general booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.) [drm] Found bochs VGA, ID 0xb0c0. [drm] Framebuffer size 16384 kB @ 0xfd00, mmio @ 0xfebd4000. [TTM] Zone kernel: Available graphics memory: 390536 kiB [TTM] Zone highmem: Available graphics memory: 4659530 kiB [TTM] Initializing pool allocator [TTM] Initializing DMA pool allocator [ cut here ] kernel BUG at arch/x86/mm/fault.c:269! invalid opcode: [#1] SMP PTI CPU: 0 PID: 349 Comm: systemd-udevd Tainted: GW 4.18.0- rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 EIP: vmalloc_fault+0x1d7/0x200 Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00 00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3 EAX: 02788000 EBX: c85b6de8 ECX: 0080 EDX: ESI: 000fd000 EDI: fdf3 EBP: f4743994 ESP: f474397c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083 CR0: 80050033 CR2: f7a0 CR3: 347f6000 CR4: 06f0 Call Trace: __do_page_fault+0x340/0x4c0 do_page_fault+0x25/0xf0 ? ttm_mem_reg_ioremap+0xe5/0x100 [ttm] ? kvm_async_pf_task_wait+0x1b0/0x1b0 do_async_page_fault+0x55/0x80 common_exception+0x13f/0x146 EIP: memset+0xb/0x20 Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b 5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 aa 89 d8 5b 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 90 66 EAX: EBX: f7a0 ECX: 0030 EDX: ESI: f4743b9c EDI: f7a0 EBP: f4743a4c ESP: f4743a44 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206 ttm_bo_move_memcpy+0x49a/0x4c0 [ttm] ? _cond_resched+0x17/0x40 ttm_bo_handle_move_mem+0x554/0x570 [ttm] ? ttm_bo_mem_space+0x211/0x440 [ttm] ttm_bo_validate+0xf5/0x110 [ttm] bochs_bo_pin+0xde/0x1c0 [bochs_drm]
[PATCH 0/3] PTI for x86-32 Fixes and Updates
Hi, here are 3 patches which update the PTI-x86-32 patches recently merged into the tip-tree. The patches are ordered by importance: Patch 1: Very important, it fixes a vmalloc-fault in NMI context when PTI is enabled. This is pretty unlikely to hit when starting perf on an idle machine, which is why I didn't find it earlier in my testing. I always started 'perf top' first :/ But when I start 'perf top' last when the kernel-compile already runs, it hits almost immediatly. Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK to also take VM86 into account. This is not strictly necessary because the slow-path also works for VM86 mode but it is not how the code was intended to work. And it breaks when Patch 3 is applied on-top. Patch 3: Implement the reduced copying in the paranoid entry/exit path as suggested by Andy Lutomirski while reviewing version 7 of the original patches. I have the x86/tip branch with these patches on-top running my test for 6h now, with no issues so far. So for now it looks like there are no scheduling points or irq-enabled sections reached from the paranoid entry/exit paths and we always return to the entry-stack we came from. I keep the test running over the weekend at least. Please review. [ If Patch 1 looks good to the maintainers I suggest applying it soon, before too many linux-next testers run into this issue. It is actually the reason why I send out the patches _now_ and didn't wait until next week when the other two patches got more testing from my side. ] Thanks, Joerg Joerg Roedel (3): perf/core: Make sure the ring-buffer is mapped in all page-tables x86/entry/32: Check for VM86 mode in slow-path check x86/entry/32: Copy only ptregs on paranoid entry/exit path arch/x86/entry/entry_32.S | 82 ++--- kernel/events/ring_buffer.c | 10 ++ 2 files changed, 58 insertions(+), 34 deletions(-) -- 2.7.4
[PATCH 0/3] PTI for x86-32 Fixes and Updates
Hi, here are 3 patches which update the PTI-x86-32 patches recently merged into the tip-tree. The patches are ordered by importance: Patch 1: Very important, it fixes a vmalloc-fault in NMI context when PTI is enabled. This is pretty unlikely to hit when starting perf on an idle machine, which is why I didn't find it earlier in my testing. I always started 'perf top' first :/ But when I start 'perf top' last when the kernel-compile already runs, it hits almost immediatly. Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK to also take VM86 into account. This is not strictly necessary because the slow-path also works for VM86 mode but it is not how the code was intended to work. And it breaks when Patch 3 is applied on-top. Patch 3: Implement the reduced copying in the paranoid entry/exit path as suggested by Andy Lutomirski while reviewing version 7 of the original patches. I have the x86/tip branch with these patches on-top running my test for 6h now, with no issues so far. So for now it looks like there are no scheduling points or irq-enabled sections reached from the paranoid entry/exit paths and we always return to the entry-stack we came from. I keep the test running over the weekend at least. Please review. [ If Patch 1 looks good to the maintainers I suggest applying it soon, before too many linux-next testers run into this issue. It is actually the reason why I send out the patches _now_ and didn't wait until next week when the other two patches got more testing from my side. ] Thanks, Joerg Joerg Roedel (3): perf/core: Make sure the ring-buffer is mapped in all page-tables x86/entry/32: Check for VM86 mode in slow-path check x86/entry/32: Copy only ptregs on paranoid entry/exit path arch/x86/entry/entry_32.S | 82 ++--- kernel/events/ring_buffer.c | 10 ++ 2 files changed, 58 insertions(+), 34 deletions(-) -- 2.7.4