Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Pavel Machekwrote: > On Mon 2017-09-25 09:33:42, Ingo Molnar wrote: > > > > * Pavel Machek wrote: > > > > > > For example, there would be collision with regular user-space mappings, > > > > right? > > > > Can local unprivileged users use mmap(MAP_FIXED) probing to figure out > > > > where > > > > the kernel lives? > > > > > > Local unpriviledged users can probably get your secret bits using cache > > > probing > > > and jump prediction buffers. > > > > > > Yes, you don't want to leak the information using mmap(MAP_FIXED), but > > > CPU will > > > leak it for you, anyway. > > > > Depends on the CPU I think, and CPU vendors are busy trying to mitigate > > this > > angle. > > I believe any x86 CPU running Linux will leak it. And with CPU vendors > putting "artifical inteligence" into branch prediction, no, I don't > think it is going to get better. > > That does not mean we shoudl not prevent mmap() info leak, but... That might or might not be so, but there's a world of a difference between running a relatively long statistical attack figuring out the kernel's location, versus being able to programmatically probe the kernel's location by using large MAP_FIXED user-space mmap()s, within a few dozen microseconds or so and a 100% guaranteed, non-statistical result. Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Mon 2017-09-25 09:33:42, Ingo Molnar wrote: > > * Pavel Machekwrote: > > > > For example, there would be collision with regular user-space mappings, > > > right? > > > Can local unprivileged users use mmap(MAP_FIXED) probing to figure out > > > where > > > the kernel lives? > > > > Local unpriviledged users can probably get your secret bits using cache > > probing > > and jump prediction buffers. > > > > Yes, you don't want to leak the information using mmap(MAP_FIXED), but CPU > > will > > leak it for you, anyway. > > Depends on the CPU I think, and CPU vendors are busy trying to mitigate this > angle. I believe any x86 CPU running Linux will leak it. And with CPU vendors putting "artifical inteligence" into branch prediction, no, I don't think it is going to get better. That does not mean we shoudl not prevent mmap() info leak, but... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Sat, Sep 23, 2017 at 2:43 AM, Ingo Molnarwrote: > > * Thomas Garnier wrote: > >> > 2) we first implement the additional entropy bits that Linus suggested. >> > >> > does this work for you? >> >> Sure, I can look at how feasible that is. If it is, can I send >> everything as part of the same patch set? The additional entropy would >> be enabled for all KASLR but PIE will be off-by-default of course. > > Sure, can all be part of the same series. I looked deeper in the change Linus proposed (moving the .text section based on the cacheline). I think the complexity is too high for the value of this change. To move only the .text section would require at least the following changes: - Overall change on how relocations are processed, need to separate relocations in and outside of the .text section. - Break assumptions on _text alignment while keeping calculation on size accurate (for example _end - _text). With a rough attempt at this, I managed to pass early boot and still crash later on. This change would be valuable if you leak the address of a section other than .text and you want to know where .text is. Meaning the main bug that you are trying to exploit only allow you to execute code (and you are trying to ROP in .text). I would argue that a better mitigation for this type of bugs is moving function pointer to read-only sections and using stack cookies (for ret address). This change won't prevent other type of attacks, like data corruption. I think it would be more valuable to look at something like selfrando / pagerando [1] but maybe wait a bit for it to be more mature (especially on the debugging side). What do you think? [1] http://lists.llvm.org/pipermail/llvm-dev/2017-June/113794.html > > Thanks, > > Ingo -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Pavel Machekwrote: > > For example, there would be collision with regular user-space mappings, > > right? > > Can local unprivileged users use mmap(MAP_FIXED) probing to figure out > > where > > the kernel lives? > > Local unpriviledged users can probably get your secret bits using cache > probing > and jump prediction buffers. > > Yes, you don't want to leak the information using mmap(MAP_FIXED), but CPU > will > leak it for you, anyway. Depends on the CPU I think, and CPU vendors are busy trying to mitigate this angle. Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
Hi! > > We do need to consider how we want modules to fit into whatever model we > > choose, though. They can be adjacent, or we could go with a more > > traditional dynamic link model where the modules can be separate, and > > chained together with the main kernel via the GOT. > > So I believe we should start with 'adjacent'. The thing is, having modules > separately randomized mostly helps if any of the secret locations fails and > we want to prevent hopping from one to the other. But if one the > kernel-privileged > secret location fails then KASLR has already failed to a significant degree... > > So I think the large-PIC model for modules does not buy us any real > advantages in > practice, and the disadvantages of large-PIC are real and most Linux users > have to > pay that cost unconditionally, as distro kernels have half of their kernel > functionality living in modules. > > But I do see fundamental value in being able to hide the kernel somewhere in > a ~48 > bits address space, especially if we also implement Linus's suggestion to > utilize > the lower bits as well. 0..281474976710656 is a nicely large range and will > get > larger with time. > > But it should all be done smartly and carefully: > > For example, there would be collision with regular user-space mappings, right? > Can local unprivileged users use mmap(MAP_FIXED) probing to figure out where > the kernel lives? Local unpriviledged users can probably get your secret bits using cache probing and jump prediction buffers. Yes, you don't want to leak the information using mmap(MAP_FIXED), but CPU will leak it for you, anyway. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* H. Peter Anvinwrote: > We do need to consider how we want modules to fit into whatever model we > choose, though. They can be adjacent, or we could go with a more > traditional dynamic link model where the modules can be separate, and > chained together with the main kernel via the GOT. So I believe we should start with 'adjacent'. The thing is, having modules separately randomized mostly helps if any of the secret locations fails and we want to prevent hopping from one to the other. But if one the kernel-privileged secret location fails then KASLR has already failed to a significant degree... So I think the large-PIC model for modules does not buy us any real advantages in practice, and the disadvantages of large-PIC are real and most Linux users have to pay that cost unconditionally, as distro kernels have half of their kernel functionality living in modules. But I do see fundamental value in being able to hide the kernel somewhere in a ~48 bits address space, especially if we also implement Linus's suggestion to utilize the lower bits as well. 0..281474976710656 is a nicely large range and will get larger with time. But it should all be done smartly and carefully: For example, there would be collision with regular user-space mappings, right? Can local unprivileged users use mmap(MAP_FIXED) probing to figure out where the kernel lives? Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* H. Peter Anvinwrote: > On 09/22/17 09:32, Ingo Molnar wrote: > > > > BTW., I think things improved with ORC because with ORC we have RBP as an > > extra > > register and with PIE we lose RBX - so register pressure in code generation > > is > > lower. > > > > We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64 > has RIP-relative addressing there is no need for a dedicated PIC register. Indeed, but we'd use a new register _a lot_ for constructs, transforming: movr9,QWORD PTR [r11*8-0x7e3da060] (8 bytes) into: learbx,[rip+] (7 bytes) movr9,QWORD PTR [rbx+r11*8] (6 bytes) ... which I suppose is quite close to (but not the same as) 'losing' RBX. Of course the compiler can pick other registers as well, not that it matters much to register pressure in larger functions in the end. Plus if the compiler has to pick a callee-saved register there's the additional saving/restoring overhead of that as well. Right? > I'm somewhat confused how we can have as much as almost 1% overhead. I > suspect > that we end up making a GOT and maybe even a PLT for no good reason. So the above transformation alone would explain a good chunk of the overhead I think. Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Thomas Garnierwrote: > > 2) we first implement the additional entropy bits that Linus suggested. > > > > does this work for you? > > Sure, I can look at how feasible that is. If it is, can I send > everything as part of the same patch set? The additional entropy would > be enabled for all KASLR but PIE will be off-by-default of course. Sure, can all be part of the same series. Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On September 23, 2017 3:06:16 AM GMT+08:00, "H. Peter Anvin"wrote: >On 09/22/17 11:57, Kees Cook wrote: >> On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvin >wrote: >>> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since >x86-64 >>> has RIP-relative addressing there is no need for a dedicated PIC >register. >> >> FWIW, since gcc 5, the PIC register isn't totally lost. It is now >> reusable, and that seems to have improved performance: >> https://gcc.gnu.org/gcc-5/changes.html > >It still talks about a PIC register on x86-64, which confuses me. >Perhaps older gcc's would allocate a PIC register under certain >circumstances, and then lose it for the entire function? > >For i386, the PIC register is required by the ABI to be %ebx at the >point any PLT entry is called. Not an issue with -mno-plt which goes >straight to the GOT, although in most cases there needs to be a PIC >register to find the GOT unless load-time relocation is permitted. > > -hpa We need a static PIE option so that compiler can optimize it without using hidden visibility. H.J. Sent from my Android device with K-9 Mail. Please excuse my brevity.___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
,Andrew Morton ,"Paul E . McKenney" ,Nicolas Pitre ,Christopher Li ,"Rafael J . Wysocki" ,Lukas Wunner ,Mika Westerberg ,Dou Liyang ,Daniel Borkmann ,Alexei Starovoitov ,Masahiro Yamada ,Markus Trippelsdorf ,Steven Rostedt ,Rik van Riel ,David Howells ,Waiman Long ,Kyle Huey ,Peter Foley ,Tim Chen ,Catalin Marinas ,Ard Biesheuvel ,Michal Hocko ,Matthew Wilcox ,Paul Bolle ,Rob Landley ,Baoquan He ,Daniel Micay ,the arch/x86 maintainers ,Linux Crypto Mailing List ,LKML ,xen-devel ,kvm list ,Linux PM list ,linux-arch ,Sparse Mailing-list ,Kernel Hardening ,Linus Torvalds ,Peter Zijlstra ,Borislav Petkov From: "H.J. Lu" Message-ID: On September 23, 2017 3:06:16 AM GMT+08:00, "H. Peter Anvin" wrote: >On 09/22/17 11:57, Kees Cook wrote: >> On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvin >wrote: >>> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since >x86-64 >>> has RIP-relative addressing there is no need for a dedicated PIC >register. >> >> FWIW, since gcc 5, the PIC register isn't totally lost. It is now >> reusable, and that seems to have improved performance: >> https://gcc.gnu.org/gcc-5/changes.html > >It still talks about a PIC register on x86-64, which confuses me. >Perhaps older gcc's would allocate a PIC register under certain >circumstances, and then lose it for the entire function? > >For i386, the PIC register is required by the ABI to be %ebx at the >point any PLT entry is called. Not an issue with -mno-plt which goes >straight to the GOT, although in most cases there needs to be a PIC >register to find the GOT unless load-time relocation is permitted. > > -hpa We need a static PIE option so that compiler can optimize it without using hidden visibility. -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Sep 21, 2017 at 2:21 PM, Thomas Garnierwrote: > On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvel > wrote: >> >> On 21 September 2017 at 08:59, Ingo Molnar wrote: >> > >> > ( Sorry about the delay in answering this. I could blame the delay on the >> > merge >> > window, but in reality I've been procrastinating this is due to the >> > permanent, >> > non-trivial impact PIE has on generated C code. ) >> > >> > * Thomas Garnier wrote: >> > >> >> 1) PIE sometime needs two instructions to represent a single >> >> instruction on mcmodel=kernel. >> > >> > What again is the typical frequency of this occurring in an x86-64 >> > defconfig >> > kernel, with the very latest GCC? >> > >> > Also, to make sure: which unwinder did you use for your measurements, >> > frame-pointers or ORC? Please use ORC only for future numbers, as >> > frame-pointers is obsolete from a performance measurement POV. >> > >> >> 2) GCC does not optimize switches in PIE in order to reduce relocations: >> > >> > Hopefully this can either be fixed in GCC or at least influenced via a >> > compiler >> > switch in the future. >> > >> >> There are somewhat related concerns in the ARM world, so it would be >> good if we could work with the GCC developers to get a more high level >> and arch neutral command line option (-mkernel-pie? sounds yummy!) >> that stops the compiler from making inferences that only hold for >> shared libraries and/or other hosted executables (GOT indirections, >> avoiding text relocations etc). That way, we will also be able to drop >> the 'hidden' visibility override at some point, which we currently >> need to prevent the compiler from redirecting all global symbol >> references via entries in the GOT. > > My plan was to add a -mtls-reg= to switch the default segment > register for stack cookies but I can see great benefits in having a > more general kernel flag that would allow to get rid of the GOT and > PLT when you are building position independent code for the kernel. It > could also include optimizations like folding switch tables etc... > > Should we start a separate discussion on that? Anyone that would be > more experienced than I to push that to gcc & clang upstream? After separate discussion, opened: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303 > >> >> All we really need is the ability to move the image around in virtual >> memory, and things like reducing the CoW footprint or enabling ELF >> symbol preemption are completely irrelevant for us. > > > > > -- > Thomas -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 09/22/17 11:57, Kees Cook wrote: > On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvinwrote: >> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64 >> has RIP-relative addressing there is no need for a dedicated PIC register. > > FWIW, since gcc 5, the PIC register isn't totally lost. It is now > reusable, and that seems to have improved performance: > https://gcc.gnu.org/gcc-5/changes.html It still talks about a PIC register on x86-64, which confuses me. Perhaps older gcc's would allocate a PIC register under certain circumstances, and then lose it for the entire function? For i386, the PIC register is required by the ABI to be %ebx at the point any PLT entry is called. Not an issue with -mno-plt which goes straight to the GOT, although in most cases there needs to be a PIC register to find the GOT unless load-time relocation is permitted. -hpa ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 09/22/17 09:32, Ingo Molnar wrote: > > BTW., I think things improved with ORC because with ORC we have RBP as an > extra > register and with PIE we lose RBX - so register pressure in code generation > is > lower. > We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64 has RIP-relative addressing there is no need for a dedicated PIC register. I'm somewhat confused how we can have as much as almost 1% overhead. I suspect that we end up making a GOT and maybe even a PLT for no good reason. -hpa ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvinwrote: > On 09/22/17 09:32, Ingo Molnar wrote: >> >> BTW., I think things improved with ORC because with ORC we have RBP as an >> extra >> register and with PIE we lose RBX - so register pressure in code generation >> is >> lower. >> > > We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64 > has RIP-relative addressing there is no need for a dedicated PIC register. > > I'm somewhat confused how we can have as much as almost 1% overhead. I > suspect that we end up making a GOT and maybe even a PLT for no good reason. We have a GOT with very few entries, mainly linker script globals that I think we can work to reduce or remove. We have a PLT but it is empty. On latest iteration (not sent yet), modules have PLT32 relocations but no PLT entry. I got rid of mcmodel=large for modules and instead I move the beginning of the module section just after the kernel so relative relocations work. > > -hpa -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvinwrote: > We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64 > has RIP-relative addressing there is no need for a dedicated PIC register. FWIW, since gcc 5, the PIC register isn't totally lost. It is now reusable, and that seems to have improved performance: https://gcc.gnu.org/gcc-5/changes.html -Kees -- Kees Cook Pixel Security ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 08/21/17 07:28, Peter Zijlstra wrote: > > Ah, I see, this is large mode and that needs to use MOVABS to load 64bit > immediates. Still, small RIP relative should be able to live at any > point as long as everything lives inside the same 2G relative range, so > would still allow the goal of increasing the KASLR range. > > So I'm not seeing how we need large mode for that. That said, after > reading up on all this, RIP relative will not be too pretty either, > while CALL is naturally RIP relative, data still needs an explicit %rip > offset, still loads better than the large model. > The large model makes no sense whatsoever. I think what we're actually looking for is the small-PIC model. Ingo asked: > I.e. is there no GCC code generation mode where code can be placed anywhere > in the > canonical address space, yet call and jump distance is within 31 bits so that > the > generated code is fast? That's the small-PIC model. I think if all symbols are forced to hidden then it won't even need a GOT/PLT. We do need to consider how we want modules to fit into whatever model we choose, though. They can be adjacent, or we could go with a more traditional dynamic link model where the modules can be separate, and chained together with the main kernel via the GOT. -hpa ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Fri, Sep 22, 2017 at 9:32 AM, Ingo Molnarwrote: > > * Thomas Garnier wrote: > >> On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar wrote: >> > >> > ( Sorry about the delay in answering this. I could blame the delay on the >> > merge >> > window, but in reality I've been procrastinating this is due to the >> > permanent, >> > non-trivial impact PIE has on generated C code. ) >> > >> > * Thomas Garnier wrote: >> > >> >> 1) PIE sometime needs two instructions to represent a single >> >> instruction on mcmodel=kernel. >> > >> > What again is the typical frequency of this occurring in an x86-64 >> > defconfig >> > kernel, with the very latest GCC? >> >> I am not sure what is the best way to measure that. > > If this is the dominant factor then 'sizeof vmlinux' ought to be enough: > >> With ORC: PIE .text is 0.814224% than baseline > > I.e. the overhead is +0.81% in both size and (roughly) in number of > instructions > executed. > > BTW., I think things improved with ORC because with ORC we have RBP as an > extra > register and with PIE we lose RBX - so register pressure in code generation is > lower. That make sense. > > Ok, I suspect we can try it, but my preconditions for merging it would be: > > 1) Linus doesn't NAK it (obviously) Of course. > 2) we first implement the additional entropy bits that Linus suggested. > > does this work for you? Sure, I can look at how feasible that is. If it is, can I send everything as part of the same patch set? The additional entropy would be enabled for all KASLR but PIE will be off-by-default of course. > > Thanks, > > Ingo -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Thomas Garnierwrote: > On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar wrote: > > > > ( Sorry about the delay in answering this. I could blame the delay on the > > merge > > window, but in reality I've been procrastinating this is due to the > > permanent, > > non-trivial impact PIE has on generated C code. ) > > > > * Thomas Garnier wrote: > > > >> 1) PIE sometime needs two instructions to represent a single > >> instruction on mcmodel=kernel. > > > > What again is the typical frequency of this occurring in an x86-64 defconfig > > kernel, with the very latest GCC? > > I am not sure what is the best way to measure that. If this is the dominant factor then 'sizeof vmlinux' ought to be enough: > With ORC: PIE .text is 0.814224% than baseline I.e. the overhead is +0.81% in both size and (roughly) in number of instructions executed. BTW., I think things improved with ORC because with ORC we have RBP as an extra register and with PIE we lose RBX - so register pressure in code generation is lower. Ok, I suspect we can try it, but my preconditions for merging it would be: 1) Linus doesn't NAK it (obviously) 2) we first implement the additional entropy bits that Linus suggested. does this work for you? Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Sep 21, 2017 at 9:24 PM, Markus Trippelsdorfwrote: > On 2017.09.21 at 14:21 -0700, Thomas Garnier wrote: >> On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvel >> wrote: >> > >> > On 21 September 2017 at 08:59, Ingo Molnar wrote: >> > > >> > > ( Sorry about the delay in answering this. I could blame the delay on >> > > the merge >> > > window, but in reality I've been procrastinating this is due to the >> > > permanent, >> > > non-trivial impact PIE has on generated C code. ) >> > > >> > > * Thomas Garnier wrote: >> > > >> > >> 1) PIE sometime needs two instructions to represent a single >> > >> instruction on mcmodel=kernel. >> > > >> > > What again is the typical frequency of this occurring in an x86-64 >> > > defconfig >> > > kernel, with the very latest GCC? >> > > >> > > Also, to make sure: which unwinder did you use for your measurements, >> > > frame-pointers or ORC? Please use ORC only for future numbers, as >> > > frame-pointers is obsolete from a performance measurement POV. >> > > >> > >> 2) GCC does not optimize switches in PIE in order to reduce relocations: >> > > >> > > Hopefully this can either be fixed in GCC or at least influenced via a >> > > compiler >> > > switch in the future. >> > > >> > >> > There are somewhat related concerns in the ARM world, so it would be >> > good if we could work with the GCC developers to get a more high level >> > and arch neutral command line option (-mkernel-pie? sounds yummy!) >> > that stops the compiler from making inferences that only hold for >> > shared libraries and/or other hosted executables (GOT indirections, >> > avoiding text relocations etc). That way, we will also be able to drop >> > the 'hidden' visibility override at some point, which we currently >> > need to prevent the compiler from redirecting all global symbol >> > references via entries in the GOT. >> >> My plan was to add a -mtls-reg= to switch the default segment >> register for stack cookies but I can see great benefits in having a >> more general kernel flag that would allow to get rid of the GOT and >> PLT when you are building position independent code for the kernel. It >> could also include optimizations like folding switch tables etc... >> >> Should we start a separate discussion on that? Anyone that would be >> more experienced than I to push that to gcc & clang upstream? > > Just open a gcc bug. See > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81708 as an example. Make sense, I will look into this. Thanks Andy for the stack cookie bug! > > -- > Markus -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 2017.09.21 at 14:21 -0700, Thomas Garnier wrote: > On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvel >wrote: > > > > On 21 September 2017 at 08:59, Ingo Molnar wrote: > > > > > > ( Sorry about the delay in answering this. I could blame the delay on the > > > merge > > > window, but in reality I've been procrastinating this is due to the > > > permanent, > > > non-trivial impact PIE has on generated C code. ) > > > > > > * Thomas Garnier wrote: > > > > > >> 1) PIE sometime needs two instructions to represent a single > > >> instruction on mcmodel=kernel. > > > > > > What again is the typical frequency of this occurring in an x86-64 > > > defconfig > > > kernel, with the very latest GCC? > > > > > > Also, to make sure: which unwinder did you use for your measurements, > > > frame-pointers or ORC? Please use ORC only for future numbers, as > > > frame-pointers is obsolete from a performance measurement POV. > > > > > >> 2) GCC does not optimize switches in PIE in order to reduce relocations: > > > > > > Hopefully this can either be fixed in GCC or at least influenced via a > > > compiler > > > switch in the future. > > > > > > > There are somewhat related concerns in the ARM world, so it would be > > good if we could work with the GCC developers to get a more high level > > and arch neutral command line option (-mkernel-pie? sounds yummy!) > > that stops the compiler from making inferences that only hold for > > shared libraries and/or other hosted executables (GOT indirections, > > avoiding text relocations etc). That way, we will also be able to drop > > the 'hidden' visibility override at some point, which we currently > > need to prevent the compiler from redirecting all global symbol > > references via entries in the GOT. > > My plan was to add a -mtls-reg= to switch the default segment > register for stack cookies but I can see great benefits in having a > more general kernel flag that would allow to get rid of the GOT and > PLT when you are building position independent code for the kernel. It > could also include optimizations like folding switch tables etc... > > Should we start a separate discussion on that? Anyone that would be > more experienced than I to push that to gcc & clang upstream? Just open a gcc bug. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81708 as an example. -- Markus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Sep 21, 2017 at 2:16 PM, Thomas Garnierwrote: > > On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar wrote: > > > > ( Sorry about the delay in answering this. I could blame the delay on the > > merge > > window, but in reality I've been procrastinating this is due to the > > permanent, > > non-trivial impact PIE has on generated C code. ) > > > > * Thomas Garnier wrote: > > > >> 1) PIE sometime needs two instructions to represent a single > >> instruction on mcmodel=kernel. > > > > What again is the typical frequency of this occurring in an x86-64 defconfig > > kernel, with the very latest GCC? > > I am not sure what is the best way to measure that. A very approximate approach would be to look at each instruction using the signed trick with a _32S relocation. All _32S relocations won't be translated to more instructions because some are just relocating part of an absolute mov which would be actually smaller if relative. Used this command to get a relative estimate: objdump -dr ./baseline/vmlinux | egrep -A 2 '\-0x[0-9a-f]{8}' | grep _32S | wc -l Got 6130 places, if you assume each add at least 7 bytes. It adds at least 42910 bytes on the .text section. The text section is 78599 bytes bigger from baseline to PIE. That's at least 54% of the size difference. Assuming we found all of them and we can't factor the impact on using an additional register. Similar approach with the switch table but a bit more complex: 1) Find all constructs as with an lea (%rip) followed by a jmp instruction inside a function (typical unfolded switch case). 2) Remove occurrences of less than 4 for the destination address Result: 480 switch cases in 49 functions. Each case take at least 9 bytes and the switch itself takes 16 bytes (assuming one per function). That's 5104 bytes for easy to identify switches (less than 7% of the increase). I am certainly missing a lot of differences. I checked if the percpu changes impacted the size and it doesn't (only 3 bytes added on PIE). I also tried different ways to compare the .text section like size of symbols or number of bytes on full disassembly but the results are really off from the whole .text size so I am not sure if it is the right way to go about it. > > > > > Also, to make sure: which unwinder did you use for your measurements, > > frame-pointers or ORC? Please use ORC only for future numbers, as > > frame-pointers is obsolete from a performance measurement POV. > > I used the default configuration which uses frame-pointer. I built all > the different binaries with ORC and I see an improvement in size: > > On latest revision (just built and ran performance tests this week): > > With framepointer: PIE .text is 0.837324% than baseline > > With ORC: PIE .text is 0.814224% than baseline > > Comparing baselines only, ORC is -2.849832% than frame-pointers. > > > > >> 2) GCC does not optimize switches in PIE in order to reduce relocations: > > > > Hopefully this can either be fixed in GCC or at least influenced via a > > compiler > > switch in the future. > > > >> The switches are the biggest increase on small functions but I don't > >> think they represent a large portion of the difference (number 1 is). > > > > Ok. > > > >> A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE > >> kernel being faster by 1% across multiple runs (comparing 50 runs done > >> across 5 reboots twice). I don't think PIE is faster than a > >> mcmodel=kernel but recent versions of gcc makes them fairly similar. > > > > So I think we are down to an overhead range where the inherent noise (both > > random > > and systematic one) in 'hackbench' overwhelms the signal we are trying to > > measure. > > > > So I think it's the kernel .text size change that is the best noise-free > > proxy for > > the overhead impact of PIE. > > I agree but it might be hard to measure the exact impact. What is > acceptable and what is not? > > > > > It doesn't hurt to double check actual real performance as well, just don't > > expect > > there to be much of a signal for anything but fully cached microbenchmark > > workloads. > > That's aligned with what I see in the latest performance testing. > Performance is close enough that it is hard to get exact numbers (pie > is just a bit slower than baseline on hackench (~1%)). > > > > > Thanks, > > > > Ingo > > > > -- > Thomas -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvelwrote: > > On 21 September 2017 at 08:59, Ingo Molnar wrote: > > > > ( Sorry about the delay in answering this. I could blame the delay on the > > merge > > window, but in reality I've been procrastinating this is due to the > > permanent, > > non-trivial impact PIE has on generated C code. ) > > > > * Thomas Garnier wrote: > > > >> 1) PIE sometime needs two instructions to represent a single > >> instruction on mcmodel=kernel. > > > > What again is the typical frequency of this occurring in an x86-64 defconfig > > kernel, with the very latest GCC? > > > > Also, to make sure: which unwinder did you use for your measurements, > > frame-pointers or ORC? Please use ORC only for future numbers, as > > frame-pointers is obsolete from a performance measurement POV. > > > >> 2) GCC does not optimize switches in PIE in order to reduce relocations: > > > > Hopefully this can either be fixed in GCC or at least influenced via a > > compiler > > switch in the future. > > > > There are somewhat related concerns in the ARM world, so it would be > good if we could work with the GCC developers to get a more high level > and arch neutral command line option (-mkernel-pie? sounds yummy!) > that stops the compiler from making inferences that only hold for > shared libraries and/or other hosted executables (GOT indirections, > avoiding text relocations etc). That way, we will also be able to drop > the 'hidden' visibility override at some point, which we currently > need to prevent the compiler from redirecting all global symbol > references via entries in the GOT. My plan was to add a -mtls-reg= to switch the default segment register for stack cookies but I can see great benefits in having a more general kernel flag that would allow to get rid of the GOT and PLT when you are building position independent code for the kernel. It could also include optimizations like folding switch tables etc... Should we start a separate discussion on that? Anyone that would be more experienced than I to push that to gcc & clang upstream? > > All we really need is the ability to move the image around in virtual > memory, and things like reducing the CoW footprint or enabling ELF > symbol preemption are completely irrelevant for us. -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnarwrote: > > ( Sorry about the delay in answering this. I could blame the delay on the > merge > window, but in reality I've been procrastinating this is due to the > permanent, > non-trivial impact PIE has on generated C code. ) > > * Thomas Garnier wrote: > >> 1) PIE sometime needs two instructions to represent a single >> instruction on mcmodel=kernel. > > What again is the typical frequency of this occurring in an x86-64 defconfig > kernel, with the very latest GCC? I am not sure what is the best way to measure that. > > Also, to make sure: which unwinder did you use for your measurements, > frame-pointers or ORC? Please use ORC only for future numbers, as > frame-pointers is obsolete from a performance measurement POV. I used the default configuration which uses frame-pointer. I built all the different binaries with ORC and I see an improvement in size: On latest revision (just built and ran performance tests this week): With framepointer: PIE .text is 0.837324% than baseline With ORC: PIE .text is 0.814224% than baseline Comparing baselines only, ORC is -2.849832% than frame-pointers. > >> 2) GCC does not optimize switches in PIE in order to reduce relocations: > > Hopefully this can either be fixed in GCC or at least influenced via a > compiler > switch in the future. > >> The switches are the biggest increase on small functions but I don't >> think they represent a large portion of the difference (number 1 is). > > Ok. > >> A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE >> kernel being faster by 1% across multiple runs (comparing 50 runs done >> across 5 reboots twice). I don't think PIE is faster than a >> mcmodel=kernel but recent versions of gcc makes them fairly similar. > > So I think we are down to an overhead range where the inherent noise (both > random > and systematic one) in 'hackbench' overwhelms the signal we are trying to > measure. > > So I think it's the kernel .text size change that is the best noise-free > proxy for > the overhead impact of PIE. I agree but it might be hard to measure the exact impact. What is acceptable and what is not? > > It doesn't hurt to double check actual real performance as well, just don't > expect > there to be much of a signal for anything but fully cached microbenchmark > workloads. That's aligned with what I see in the latest performance testing. Performance is close enough that it is hard to get exact numbers (pie is just a bit slower than baseline on hackench (~1%)). > > Thanks, > > Ingo -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 21 September 2017 at 08:59, Ingo Molnarwrote: > > ( Sorry about the delay in answering this. I could blame the delay on the > merge > window, but in reality I've been procrastinating this is due to the > permanent, > non-trivial impact PIE has on generated C code. ) > > * Thomas Garnier wrote: > >> 1) PIE sometime needs two instructions to represent a single >> instruction on mcmodel=kernel. > > What again is the typical frequency of this occurring in an x86-64 defconfig > kernel, with the very latest GCC? > > Also, to make sure: which unwinder did you use for your measurements, > frame-pointers or ORC? Please use ORC only for future numbers, as > frame-pointers is obsolete from a performance measurement POV. > >> 2) GCC does not optimize switches in PIE in order to reduce relocations: > > Hopefully this can either be fixed in GCC or at least influenced via a > compiler > switch in the future. > There are somewhat related concerns in the ARM world, so it would be good if we could work with the GCC developers to get a more high level and arch neutral command line option (-mkernel-pie? sounds yummy!) that stops the compiler from making inferences that only hold for shared libraries and/or other hosted executables (GOT indirections, avoiding text relocations etc). That way, we will also be able to drop the 'hidden' visibility override at some point, which we currently need to prevent the compiler from redirecting all global symbol references via entries in the GOT. All we really need is the ability to move the image around in virtual memory, and things like reducing the CoW footprint or enabling ELF symbol preemption are completely irrelevant for us. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
( Sorry about the delay in answering this. I could blame the delay on the merge window, but in reality I've been procrastinating this is due to the permanent, non-trivial impact PIE has on generated C code. ) * Thomas Garnierwrote: > 1) PIE sometime needs two instructions to represent a single > instruction on mcmodel=kernel. What again is the typical frequency of this occurring in an x86-64 defconfig kernel, with the very latest GCC? Also, to make sure: which unwinder did you use for your measurements, frame-pointers or ORC? Please use ORC only for future numbers, as frame-pointers is obsolete from a performance measurement POV. > 2) GCC does not optimize switches in PIE in order to reduce relocations: Hopefully this can either be fixed in GCC or at least influenced via a compiler switch in the future. > The switches are the biggest increase on small functions but I don't > think they represent a large portion of the difference (number 1 is). Ok. > A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE > kernel being faster by 1% across multiple runs (comparing 50 runs done > across 5 reboots twice). I don't think PIE is faster than a > mcmodel=kernel but recent versions of gcc makes them fairly similar. So I think we are down to an overhead range where the inherent noise (both random and systematic one) in 'hackbench' overwhelms the signal we are trying to measure. So I think it's the kernel .text size change that is the best noise-free proxy for the overhead impact of PIE. It doesn't hurt to double check actual real performance as well, just don't expect there to be much of a signal for anything but fully cached microbenchmark workloads. Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Fri, Aug 25, 2017 at 8:05 AM, Thomas Garnierwrote: > On Fri, Aug 25, 2017 at 1:04 AM, Ingo Molnar wrote: >> >> * Thomas Garnier wrote: >> >>> With the fix for function tracing, the hackbench results have an >>> average of +0.8 to +1.4% (from +8% to +10% before). With a default >>> configuration, the numbers are closer to 0.8%. >>> >>> On the .text size, with gcc 4.9 I see +0.8% on default configuration >>> and +1.180% on the ubuntu configuration. >> >> A 1% text size increase is still significant. Could you look at the >> disassembly, >> where does the size increase come from? > > I will take a look, in this current iteration I added the .got and > .got.plt so removing them will remove a big (even if they are small, > we don't use them to increase perf). > > What do you think about the perf numbers in general so far? I looked at the size increase. I could identify two common cases: 1) PIE sometime needs two instructions to represent a single instruction on mcmodel=kernel. For example, this instruction plays on the sign extension (mcmodel=kernel): movr9,QWORD PTR [r11*8-0x7e3da060] (8 bytes) The address 0x81c25fa0 can be represented as -0x7e3da060 using a 32S relocation. with PIE: learbx,[rip+] (7 bytes) movr9,QWORD PTR [rbx+r11*8] (6 bytes) 2) GCC does not optimize switches in PIE in order to reduce relocations: For example the switch in phy_modes [1]: static inline const char *phy_modes(phy_interface_t interface) { switch (interface) { case PHY_INTERFACE_MODE_NA: return ""; case PHY_INTERFACE_MODE_INTERNAL: return "internal"; case PHY_INTERFACE_MODE_MII: return "mii"; Without PIE (gcc 7.2.0), the whole table is optimize to be one instruction: 0x0040045b <+27>:movrdi,QWORD PTR [rax*8+0x400660] With PIE (gcc 7.2.0): 0x0641 <+33>:movsxd rax,DWORD PTR [rdx+rax*4] 0x0645 <+37>:addrax,rdx 0x0648 <+40>:jmprax 0x065d <+61>:leardi,[rip+0x264]# 0x8c8 0x0664 <+68>:jmp0x651 0x0666 <+70>:leardi,[rip+0x2bc]# 0x929 0x066d <+77>:jmp0x651 0x066f <+79>:leardi,[rip+0x2a8]# 0x91e 0x0676 <+86>:jmp0x651 0x0678 <+88>:leardi,[rip+0x294]# 0x913 0x067f <+95>:jmp0x651 That's a deliberate choice, clang is able to optimize it (clang-3.8): 0x0963 <+19>:learcx,[rip+0x200406]# 0x200d70 0x096a <+26>:movrdi,QWORD PTR [rcx+rax*8] I checked gcc and the code deciding to fold the switch basically do not do it for pic to reduce relocations [2]. The switches are the biggest increase on small functions but I don't think they represent a large portion of the difference (number 1 is). A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE kernel being faster by 1% across multiple runs (comparing 50 runs done across 5 reboots twice). I don't think PIE is faster than a mcmodel=kernel but recent versions of gcc makes them fairly similar. [1] http://elixir.free-electrons.com/linux/v4.13-rc7/source/include/linux/phy.h#L113 [2] https://github.com/gcc-mirror/gcc/blob/7977b0509f07e42fbe0f06efcdead2b7e4a5135f/gcc/tree-switch-conversion.c#L828 > >> >> Thanks, >> >> Ingo > > > > -- > Thomas -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 08/21/17 07:31, Peter Zijlstra wrote: > On Tue, Aug 15, 2017 at 07:20:38AM -0700, Thomas Garnier wrote: >> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnarwrote: > >>> Have you considered a kernel with -mcmodel=small (or medium) instead of >>> -fpie >>> -mcmodel=large? We can pick a random 2GB window in the (non-kernel) >>> canonical >>> x86-64 address space to randomize the location of kernel text. The location >>> of >>> modules can be further randomized within that 2GB window. >> >> -model=small/medium assume you are on the low 32-bit. It generates >> instructions where the virtual addresses have the high 32-bit to be >> zero. > > That's a compiler fail, right? Because the SDM states that for "CALL > rel32" the 32bit displacement is sign extended on x86_64. > No. It is about whether you can do something like: movl $variable, %eax/* rax = */ or addl %ecx,variable(,%rsi,4) /* variable[rsi] += ecx */ ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Aug 24, 2017 at 2:42 PM, Linus Torvaldswrote: > > On Thu, Aug 24, 2017 at 2:13 PM, Thomas Garnier wrote: > > > > My original performance testing was done with an Ubuntu generic > > configuration. This configuration has the CONFIG_FUNCTION_TRACER > > option which was incompatible with PIE. The tracer failed to replace > > the __fentry__ call by a nop slide on each traceable function because > > the instruction was not the one expected. If PIE is enabled, gcc > > generates a difference call instruction based on the GOT without > > checking the visibility options (basically call *__fentry__@GOTPCREL). > > Gah. > > Don't we actually have *more* address bits for randomization at the > low end, rather than getting rid of -mcmodel=kernel? We have but I think we use most of it for potential modules and the fixmap but it is not that big. The increase in range from 1G to 3G is just an example and a way to ensure PIE work as expected. The long term goal is being able to put the kernel where we want in memory, randomizing the position and the order of almost all memory sections. That would be valuable against BTB attack [1] for example where randomization on the low 32-bit is ineffective. [1] https://github.com/felixwilhelm/mario_baslr > > Has anybody looked at just moving kernel text by smaller values than > the page size? Yeah, yeah, the kernel has several sections that need > page alignment, but I think we could relocate normal text by just the > cacheline size, and that sounds like it would give several bits of > randomness with little downside. I didn't look into it. There is value in it depending on performance impact. I think both PIE and lower grain randomization would be useful. > > Or has somebody already looked at it and I just missed it? > >Linus -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Fri, Aug 25, 2017 at 1:04 AM, Ingo Molnarwrote: > > * Thomas Garnier wrote: > >> With the fix for function tracing, the hackbench results have an >> average of +0.8 to +1.4% (from +8% to +10% before). With a default >> configuration, the numbers are closer to 0.8%. >> >> On the .text size, with gcc 4.9 I see +0.8% on default configuration >> and +1.180% on the ubuntu configuration. > > A 1% text size increase is still significant. Could you look at the > disassembly, > where does the size increase come from? I will take a look, in this current iteration I added the .got and .got.plt so removing them will remove a big (even if they are small, we don't use them to increase perf). What do you think about the perf numbers in general so far? > > Thanks, > > Ingo -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Thomas Garnierwrote: > With the fix for function tracing, the hackbench results have an > average of +0.8 to +1.4% (from +8% to +10% before). With a default > configuration, the numbers are closer to 0.8%. > > On the .text size, with gcc 4.9 I see +0.8% on default configuration > and +1.180% on the ubuntu configuration. A 1% text size increase is still significant. Could you look at the disassembly, where does the size increase come from? Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, 24 Aug 2017 14:13:38 -0700 Thomas Garnierwrote: > With the fix for function tracing, the hackbench results have an > average of +0.8 to +1.4% (from +8% to +10% before). With a default > configuration, the numbers are closer to 0.8%. Wow, an empty fentry function not "nop"ed out only added 8% to 10% overhead. I never did the benchmarks of that since I did it before fentry was introduced, which was with the old "mcount". That gave an average of 13% overhead in hackbench. -- Steve ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Aug 24, 2017 at 2:13 PM, Thomas Garnierwrote: > > My original performance testing was done with an Ubuntu generic > configuration. This configuration has the CONFIG_FUNCTION_TRACER > option which was incompatible with PIE. The tracer failed to replace > the __fentry__ call by a nop slide on each traceable function because > the instruction was not the one expected. If PIE is enabled, gcc > generates a difference call instruction based on the GOT without > checking the visibility options (basically call *__fentry__@GOTPCREL). Gah. Don't we actually have *more* address bits for randomization at the low end, rather than getting rid of -mcmodel=kernel? Has anybody looked at just moving kernel text by smaller values than the page size? Yeah, yeah, the kernel has several sections that need page alignment, but I think we could relocate normal text by just the cacheline size, and that sounds like it would give several bits of randomness with little downside. Or has somebody already looked at it and I just missed it? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Aug 17, 2017 at 7:10 AM, Thomas Garnierwrote: > > On Thu, Aug 17, 2017 at 1:09 AM, Ingo Molnar wrote: > > > > > > * Thomas Garnier wrote: > > > > > > > -model=small/medium assume you are on the low 32-bit. It generates > > > > > instructions where the virtual addresses have the high 32-bit to be > > > > > zero. > > > > > > > > How are these assumptions hardcoded by GCC? Most of the instructions > > > > should be > > > > relocatable straight away, as most call/jump/branch instructions are > > > > RIP-relative. > > > > > > I think PIE is capable to use relative instructions well. mcmodel=large > > > assumes > > > symbols can be anywhere. > > > > So if the numbers in your changelog and Kconfig text cannot be trusted, > > there's > > this description of the size impact which I suspect is less susceptible to > > measurement error: > > > > + The kernel and modules will generate slightly more assembly (1 to > > 2% > > + increase on the .text sections). The vmlinux binary will be > > + significantly smaller due to less relocations. > > > > ... but describing a 1-2% kernel text size increase as "slightly more > > assembly" > > shows a gratituous disregard to kernel code generation quality! In reality > > that's > > a huge size increase that in most cases will almost directly transfer to a > > 1-2% > > slowdown for kernel intense workloads. > > > > > > Where does that size increase come from, if PIE is capable of using relative > > instructins well? Does it come from the loss of a generic register and the > > resulting increase in register pressure, stack spills, etc.? > > I will try to gather more information on the size increase. The size > increase might be smaller with gcc 4.9 given performance was much > better. Coming back on this thread as I identified the root cause of the performance issue. My original performance testing was done with an Ubuntu generic configuration. This configuration has the CONFIG_FUNCTION_TRACER option which was incompatible with PIE. The tracer failed to replace the __fentry__ call by a nop slide on each traceable function because the instruction was not the one expected. If PIE is enabled, gcc generates a difference call instruction based on the GOT without checking the visibility options (basically call *__fentry__@GOTPCREL). With the fix for function tracing, the hackbench results have an average of +0.8 to +1.4% (from +8% to +10% before). With a default configuration, the numbers are closer to 0.8%. On the .text size, with gcc 4.9 I see +0.8% on default configuration and +1.180% on the ubuntu configuration. Next iteration should have an updated set of performance metrics (will try to use gcc 6.0 or higher) and incorporate the fix on function tracing. Let me know if you have questions and feedback. > > > > > So I'm still unhappy about this all, and about the attitude surrounding it. > > > > Thanks, > > > > Ingo > > > > > -- > Thomas -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Mon, Aug 21, 2017 at 7:31 AM, Peter Zijlstrawrote: > On Tue, Aug 15, 2017 at 07:20:38AM -0700, Thomas Garnier wrote: >> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar wrote: > >> > Have you considered a kernel with -mcmodel=small (or medium) instead of >> > -fpie >> > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) >> > canonical >> > x86-64 address space to randomize the location of kernel text. The >> > location of >> > modules can be further randomized within that 2GB window. >> >> -model=small/medium assume you are on the low 32-bit. It generates >> instructions where the virtual addresses have the high 32-bit to be >> zero. > > That's a compiler fail, right? Because the SDM states that for "CALL > rel32" the 32bit displacement is sign extended on x86_64. > That's different than what I expected at first too. Now, I think I have an alternative of using mcmodel=large. I could use -fPIC and ensure modules are never far away from the main kernel (moving the module section start close to the random kernel end). I looked at it and that seems possible but will require more work. I plan to start with the mcmodel=large support and add this mode in a way that could benefit classic KASLR (without -fPIC) because it randomize where modules start based on the kernel. -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Tue, Aug 15, 2017 at 07:20:38AM -0700, Thomas Garnier wrote: > On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnarwrote: > > Have you considered a kernel with -mcmodel=small (or medium) instead of > > -fpie > > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) > > canonical > > x86-64 address space to randomize the location of kernel text. The location > > of > > modules can be further randomized within that 2GB window. > > -model=small/medium assume you are on the low 32-bit. It generates > instructions where the virtual addresses have the high 32-bit to be > zero. That's a compiler fail, right? Because the SDM states that for "CALL rel32" the 32bit displacement is sign extended on x86_64. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Mon, Aug 21, 2017 at 03:32:22PM +0200, Peter Zijlstra wrote: > On Wed, Aug 16, 2017 at 05:12:35PM +0200, Ingo Molnar wrote: > > Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine > > instruction level. > > > > Function calls look like this: > > > > -mcmodel=medium: > > > >757: e8 98 ff ff ff callq 6f4 > > > > -mcmodel=large > > > >77b: 48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax > >782: ff ff ff > >785: 48 8d 04 03 lea(%rbx,%rax,1),%rax > >789: ff d0 callq *%rax > > > > And we'd do this for _EVERY_ function call in the kernel. That kind of crap > > is > > totally unacceptable. > > So why does this need to be computed for every single call? How often > will we move the kernel around at runtime? > > Why can't we process the relocation at load time and then discard the > relocation tables along with the rest of __init ? Ah, I see, this is large mode and that needs to use MOVABS to load 64bit immediates. Still, small RIP relative should be able to live at any point as long as everything lives inside the same 2G relative range, so would still allow the goal of increasing the KASLR range. So I'm not seeing how we need large mode for that. That said, after reading up on all this, RIP relative will not be too pretty either, while CALL is naturally RIP relative, data still needs an explicit %rip offset, still loads better than the large model. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Wed, Aug 16, 2017 at 05:12:35PM +0200, Ingo Molnar wrote: > Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine > instruction level. > > Function calls look like this: > > -mcmodel=medium: > >757: e8 98 ff ff ff callq 6f4 > > -mcmodel=large > >77b: 48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax >782: ff ff ff >785: 48 8d 04 03 lea(%rbx,%rax,1),%rax >789: ff d0 callq *%rax > > And we'd do this for _EVERY_ function call in the kernel. That kind of crap > is > totally unacceptable. So why does this need to be computed for every single call? How often will we move the kernel around at runtime? Why can't we process the relocation at load time and then discard the relocation tables along with the rest of __init ? ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Aug 17, 2017 at 1:09 AM, Ingo Molnarwrote: > > > * Thomas Garnier wrote: > > > > > -model=small/medium assume you are on the low 32-bit. It generates > > > > instructions where the virtual addresses have the high 32-bit to be > > > > zero. > > > > > > How are these assumptions hardcoded by GCC? Most of the instructions > > > should be > > > relocatable straight away, as most call/jump/branch instructions are > > > RIP-relative. > > > > I think PIE is capable to use relative instructions well. mcmodel=large > > assumes > > symbols can be anywhere. > > So if the numbers in your changelog and Kconfig text cannot be trusted, > there's > this description of the size impact which I suspect is less susceptible to > measurement error: > > + The kernel and modules will generate slightly more assembly (1 to 2% > + increase on the .text sections). The vmlinux binary will be > + significantly smaller due to less relocations. > > ... but describing a 1-2% kernel text size increase as "slightly more > assembly" > shows a gratituous disregard to kernel code generation quality! In reality > that's > a huge size increase that in most cases will almost directly transfer to a > 1-2% > slowdown for kernel intense workloads. > > > Where does that size increase come from, if PIE is capable of using relative > instructins well? Does it come from the loss of a generic register and the > resulting increase in register pressure, stack spills, etc.? I will try to gather more information on the size increase. The size increase might be smaller with gcc 4.9 given performance was much better. > > So I'm still unhappy about this all, and about the attitude surrounding it. > > Thanks, > > Ingo -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Thomas Garnierwrote: > > > -model=small/medium assume you are on the low 32-bit. It generates > > > instructions where the virtual addresses have the high 32-bit to be zero. > > > > How are these assumptions hardcoded by GCC? Most of the instructions should > > be > > relocatable straight away, as most call/jump/branch instructions are > > RIP-relative. > > I think PIE is capable to use relative instructions well. mcmodel=large > assumes > symbols can be anywhere. So if the numbers in your changelog and Kconfig text cannot be trusted, there's this description of the size impact which I suspect is less susceptible to measurement error: + The kernel and modules will generate slightly more assembly (1 to 2% + increase on the .text sections). The vmlinux binary will be + significantly smaller due to less relocations. ... but describing a 1-2% kernel text size increase as "slightly more assembly" shows a gratituous disregard to kernel code generation quality! In reality that's a huge size increase that in most cases will almost directly transfer to a 1-2% slowdown for kernel intense workloads. Where does that size increase come from, if PIE is capable of using relative instructins well? Does it come from the loss of a generic register and the resulting increase in register pressure, stack spills, etc.? So I'm still unhappy about this all, and about the attitude surrounding it. Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Wed, Aug 16, 2017 at 8:12 AM, Ingo Molnarwrote: > > > * Thomas Garnier wrote: > > > On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar wrote: > > > > > > * Thomas Garnier wrote: > > > > > >> > Do these changes get us closer to being able to build the kernel as > > >> > truly > > >> > position independent, i.e. to place it anywhere in the valid x86-64 > > >> > address > > >> > space? Or any other advantages? > > >> > > >> Yes, PIE allows us to put the kernel anywhere in memory. It will allow > > >> us to > > >> have a full randomized address space where position and order of > > >> sections are > > >> completely random. There is still some work to get there but being able > > >> to build > > >> a PIE kernel is a significant step. > > > > > > So I _really_ dislike the whole PIE approach, because of the huge > > > slowdown: > > > > > > +config RANDOMIZE_BASE_LARGE > > > + bool "Increase the randomization range of the kernel image" > > > + depends on X86_64 && RANDOMIZE_BASE > > > + select X86_PIE > > > + select X86_MODULE_PLTS if MODULES > > > + default n > > > + ---help--- > > > + Build the kernel as a Position Independent Executable (PIE) and > > > + increase the available randomization range from 1GB to 3GB. > > > + > > > + This option impacts performance on kernel CPU intensive > > > workloads up > > > + to 10% due to PIE generated code. Impact on user-mode processes > > > and > > > + typical usage would be significantly less (0.50% when you build > > > the > > > + kernel). > > > + > > > + The kernel and modules will generate slightly more assembly (1 > > > to 2% > > > + increase on the .text sections). The vmlinux binary will be > > > + significantly smaller due to less relocations. > > > > > > To put 10% kernel overhead into perspective: enabling this option wipes > > > out about > > > 5-10 years worth of painstaking optimizations we've done to keep the > > > kernel fast > > > ... (!!) > > > > Note that 10% is the high-bound of a CPU intensive workload. > > Note that the 8-10% hackbench or even a 2%-4% range would be 'huge' in terms > of > modern kernel performance. In many cases we are literally applying cycle level > optimizations that are barely measurable. A 0.1% speedup in linear execution > speed > is already a big success. > > > I am going to start doing performance testing on -mcmodel=large to see if > > it is > > faster than -fPIE. > > Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine > instruction level. > > Function calls look like this: > > -mcmodel=medium: > >757: e8 98 ff ff ff callq 6f4 > > -mcmodel=large > >77b: 48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax >782: ff ff ff >785: 48 8d 04 03 lea(%rbx,%rax,1),%rax >789: ff d0 callq *%rax > > And we'd do this for _EVERY_ function call in the kernel. That kind of crap is > totally unacceptable. > I started looking into mcmodel=large and ran into multiple issues. In the meantime, i thought I would try difference configurations and compilers. I did 10 hackbench runs accross 10 reboots with and without pie (same commit) with gcc 4.9. I copied the result below and based on the hackbench configuration we are between -0.29% and 1.92% (average across is 0.8%) which seems more aligned with what people discussed in this thread. I don't know how I got 10% maximum on hackbench, I am still investigating. It could be the configuration I used or my base compiler being too old. > > > I think the fundamental flaw is the assumption that we need a PIE > > > executable > > > to have a freely relocatable kernel on 64-bit CPUs. > > > > > > Have you considered a kernel with -mcmodel=small (or medium) instead of > > > -fpie > > > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) > > > canonical > > > x86-64 address space to randomize the location of kernel text. The > > > location of > > > modules can be further randomized within that 2GB window. > > > > -model=small/medium assume you are on the low 32-bit. It generates > > instructions > > where the virtual addresses have the high 32-bit to be zero. > > How are these assumptions hardcoded by GCC? Most of the instructions should be > relocatable straight away, as most call/jump/branch instructions are > RIP-relative. I think PIE is capable to use relative instructions well. mcmodel=large assumes symbols can be anywhere. > > I.e. is there no GCC code generation mode where code can be placed anywhere > in the > canonical address space, yet call and jump distance is within 31 bits so that > the > generated code is fast? I think that's basically PIE. With PIE, you have the assumption everything is close, the main issue is any assembly referencing absolute addresses. > >
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 16 August 2017 at 17:26, Daniel Micaywrote: >> How are these assumptions hardcoded by GCC? Most of the instructions >> should be >> relocatable straight away, as most call/jump/branch instructions are >> RIP-relative. >> >> I.e. is there no GCC code generation mode where code can be placed >> anywhere in the >> canonical address space, yet call and jump distance is within 31 bits >> so that the >> generated code is fast? > > That's what PIE is meant to do. However, not disabling support for lazy > linking (-fno-plt) / symbol interposition (-Bsymbolic) is going to cause > it to add needless overhead. > > arm64 is using -pie -shared -Bsymbolic in arch/arm64/Makefile for their > CONFIG_RELOCATABLE option. See 08cc55b2afd97a654f71b3bebf8bb0ec89fdc498. The difference with arm64 is that its generic small code model is already position independent, so we don't have to pass -fpic or -fpie to the compiler. We only link in PIE mode to get the linker to emit the dynamic relocation tables into the ELF binary. Relative branches have a range of +/- 128 MB, which covers the kernel and modules (unless the option to randomize the module region independently has been selected, in which case branches between the kernel and modules may be resolved via PLT entries that are emitted at module load time) I am not sure how this extrapolates to x86, just adding some context. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
> How are these assumptions hardcoded by GCC? Most of the instructions > should be > relocatable straight away, as most call/jump/branch instructions are > RIP-relative. > > I.e. is there no GCC code generation mode where code can be placed > anywhere in the > canonical address space, yet call and jump distance is within 31 bits > so that the > generated code is fast? That's what PIE is meant to do. However, not disabling support for lazy linking (-fno-plt) / symbol interposition (-Bsymbolic) is going to cause it to add needless overhead. arm64 is using -pie -shared -Bsymbolic in arch/arm64/Makefile for their CONFIG_RELOCATABLE option. See 08cc55b2afd97a654f71b3bebf8bb0ec89fdc498. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Wed, 16 Aug 2017, Ingo Molnar wrote: > And we'd do this for _EVERY_ function call in the kernel. That kind of crap is > totally unacceptable. Ahh finally a limit is in sight as to how much security hardening etc can reduce kernel performance. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Thomas Garnierwrote: > On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar wrote: > > > > * Thomas Garnier wrote: > > > >> > Do these changes get us closer to being able to build the kernel as truly > >> > position independent, i.e. to place it anywhere in the valid x86-64 > >> > address > >> > space? Or any other advantages? > >> > >> Yes, PIE allows us to put the kernel anywhere in memory. It will allow us > >> to > >> have a full randomized address space where position and order of sections > >> are > >> completely random. There is still some work to get there but being able to > >> build > >> a PIE kernel is a significant step. > > > > So I _really_ dislike the whole PIE approach, because of the huge slowdown: > > > > +config RANDOMIZE_BASE_LARGE > > + bool "Increase the randomization range of the kernel image" > > + depends on X86_64 && RANDOMIZE_BASE > > + select X86_PIE > > + select X86_MODULE_PLTS if MODULES > > + default n > > + ---help--- > > + Build the kernel as a Position Independent Executable (PIE) and > > + increase the available randomization range from 1GB to 3GB. > > + > > + This option impacts performance on kernel CPU intensive workloads > > up > > + to 10% due to PIE generated code. Impact on user-mode processes > > and > > + typical usage would be significantly less (0.50% when you build > > the > > + kernel). > > + > > + The kernel and modules will generate slightly more assembly (1 to > > 2% > > + increase on the .text sections). The vmlinux binary will be > > + significantly smaller due to less relocations. > > > > To put 10% kernel overhead into perspective: enabling this option wipes out > > about > > 5-10 years worth of painstaking optimizations we've done to keep the kernel > > fast > > ... (!!) > > Note that 10% is the high-bound of a CPU intensive workload. Note that the 8-10% hackbench or even a 2%-4% range would be 'huge' in terms of modern kernel performance. In many cases we are literally applying cycle level optimizations that are barely measurable. A 0.1% speedup in linear execution speed is already a big success. > I am going to start doing performance testing on -mcmodel=large to see if it > is > faster than -fPIE. Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine instruction level. Function calls look like this: -mcmodel=medium: 757: e8 98 ff ff ff callq 6f4 -mcmodel=large 77b: 48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax 782: ff ff ff 785: 48 8d 04 03 lea(%rbx,%rax,1),%rax 789: ff d0 callq *%rax And we'd do this for _EVERY_ function call in the kernel. That kind of crap is totally unacceptable. > > I think the fundamental flaw is the assumption that we need a PIE > > executable > > to have a freely relocatable kernel on 64-bit CPUs. > > > > Have you considered a kernel with -mcmodel=small (or medium) instead of > > -fpie > > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) > > canonical > > x86-64 address space to randomize the location of kernel text. The location > > of > > modules can be further randomized within that 2GB window. > > -model=small/medium assume you are on the low 32-bit. It generates > instructions > where the virtual addresses have the high 32-bit to be zero. How are these assumptions hardcoded by GCC? Most of the instructions should be relocatable straight away, as most call/jump/branch instructions are RIP-relative. I.e. is there no GCC code generation mode where code can be placed anywhere in the canonical address space, yet call and jump distance is within 31 bits so that the generated code is fast? Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Tue, Aug 15, 2017 at 7:47 AM, Daniel Micaywrote: > On 15 August 2017 at 10:20, Thomas Garnier wrote: >> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar wrote: >>> >>> * Thomas Garnier wrote: >>> > Do these changes get us closer to being able to build the kernel as truly > position independent, i.e. to place it anywhere in the valid x86-64 > address > space? Or any other advantages? Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to have a full randomized address space where position and order of sections are completely random. There is still some work to get there but being able to build a PIE kernel is a significant step. >>> >>> So I _really_ dislike the whole PIE approach, because of the huge slowdown: >>> >>> +config RANDOMIZE_BASE_LARGE >>> + bool "Increase the randomization range of the kernel image" >>> + depends on X86_64 && RANDOMIZE_BASE >>> + select X86_PIE >>> + select X86_MODULE_PLTS if MODULES >>> + default n >>> + ---help--- >>> + Build the kernel as a Position Independent Executable (PIE) and >>> + increase the available randomization range from 1GB to 3GB. >>> + >>> + This option impacts performance on kernel CPU intensive workloads >>> up >>> + to 10% due to PIE generated code. Impact on user-mode processes >>> and >>> + typical usage would be significantly less (0.50% when you build >>> the >>> + kernel). >>> + >>> + The kernel and modules will generate slightly more assembly (1 to >>> 2% >>> + increase on the .text sections). The vmlinux binary will be >>> + significantly smaller due to less relocations. >>> >>> To put 10% kernel overhead into perspective: enabling this option wipes out >>> about >>> 5-10 years worth of painstaking optimizations we've done to keep the kernel >>> fast >>> ... (!!) >> >> Note that 10% is the high-bound of a CPU intensive workload. > > The cost can be reduced by using -fno-plt these days but some work > might be required to make that work with the kernel. > > Where does that 10% estimate in the kernel config docs come from? I'd > be surprised if it really cost that much on x86_64. That's a realistic > cost for i386 with modern GCC (it used to be worse) but I'd expect > x86_64 to be closer to 2% even for CPU intensive workloads. It should > be very close to zero with -fno-plt. I got 8 to 10% on hackbench. Other benchmarks were 4% or lower. I will do look at more recent compiler and no-plt as well. -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On 15 August 2017 at 10:20, Thomas Garnierwrote: > On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar wrote: >> >> * Thomas Garnier wrote: >> >>> > Do these changes get us closer to being able to build the kernel as truly >>> > position independent, i.e. to place it anywhere in the valid x86-64 >>> > address >>> > space? Or any other advantages? >>> >>> Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to >>> have a full randomized address space where position and order of sections >>> are >>> completely random. There is still some work to get there but being able to >>> build >>> a PIE kernel is a significant step. >> >> So I _really_ dislike the whole PIE approach, because of the huge slowdown: >> >> +config RANDOMIZE_BASE_LARGE >> + bool "Increase the randomization range of the kernel image" >> + depends on X86_64 && RANDOMIZE_BASE >> + select X86_PIE >> + select X86_MODULE_PLTS if MODULES >> + default n >> + ---help--- >> + Build the kernel as a Position Independent Executable (PIE) and >> + increase the available randomization range from 1GB to 3GB. >> + >> + This option impacts performance on kernel CPU intensive workloads >> up >> + to 10% due to PIE generated code. Impact on user-mode processes and >> + typical usage would be significantly less (0.50% when you build the >> + kernel). >> + >> + The kernel and modules will generate slightly more assembly (1 to >> 2% >> + increase on the .text sections). The vmlinux binary will be >> + significantly smaller due to less relocations. >> >> To put 10% kernel overhead into perspective: enabling this option wipes out >> about >> 5-10 years worth of painstaking optimizations we've done to keep the kernel >> fast >> ... (!!) > > Note that 10% is the high-bound of a CPU intensive workload. The cost can be reduced by using -fno-plt these days but some work might be required to make that work with the kernel. Where does that 10% estimate in the kernel config docs come from? I'd be surprised if it really cost that much on x86_64. That's a realistic cost for i386 with modern GCC (it used to be worse) but I'd expect x86_64 to be closer to 2% even for CPU intensive workloads. It should be very close to zero with -fno-plt. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnarwrote: > > * Thomas Garnier wrote: > >> > Do these changes get us closer to being able to build the kernel as truly >> > position independent, i.e. to place it anywhere in the valid x86-64 address >> > space? Or any other advantages? >> >> Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to >> have a full randomized address space where position and order of sections are >> completely random. There is still some work to get there but being able to >> build >> a PIE kernel is a significant step. > > So I _really_ dislike the whole PIE approach, because of the huge slowdown: > > +config RANDOMIZE_BASE_LARGE > + bool "Increase the randomization range of the kernel image" > + depends on X86_64 && RANDOMIZE_BASE > + select X86_PIE > + select X86_MODULE_PLTS if MODULES > + default n > + ---help--- > + Build the kernel as a Position Independent Executable (PIE) and > + increase the available randomization range from 1GB to 3GB. > + > + This option impacts performance on kernel CPU intensive workloads up > + to 10% due to PIE generated code. Impact on user-mode processes and > + typical usage would be significantly less (0.50% when you build the > + kernel). > + > + The kernel and modules will generate slightly more assembly (1 to 2% > + increase on the .text sections). The vmlinux binary will be > + significantly smaller due to less relocations. > > To put 10% kernel overhead into perspective: enabling this option wipes out > about > 5-10 years worth of painstaking optimizations we've done to keep the kernel > fast > ... (!!) Note that 10% is the high-bound of a CPU intensive workload. > > I think the fundamental flaw is the assumption that we need a PIE executable > to > have a freely relocatable kernel on 64-bit CPUs. > > Have you considered a kernel with -mcmodel=small (or medium) instead of -fpie > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) canonical > x86-64 address space to randomize the location of kernel text. The location of > modules can be further randomized within that 2GB window. -model=small/medium assume you are on the low 32-bit. It generates instructions where the virtual addresses have the high 32-bit to be zero. I am going to start doing performance testing on -mcmodel=large to see if it is faster than -fPIE. > > It should have far less performance impact than the register-losing and > overhead-inducing -fpie / -mcmodel=large (for modules) execution models. > > My quick guess is tha the performance impact might be close to zero in fact. If mcmodel=small/medium was possible for kernel, I don't think it would have less performance impact than mcmodel=large. It would still need to set the high 32-bit to be a static value, only the relocation would be a different size. > > Thanks, > > Ingo -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Thomas Garnierwrote: > > Do these changes get us closer to being able to build the kernel as truly > > position independent, i.e. to place it anywhere in the valid x86-64 address > > space? Or any other advantages? > > Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to > have a full randomized address space where position and order of sections are > completely random. There is still some work to get there but being able to > build > a PIE kernel is a significant step. So I _really_ dislike the whole PIE approach, because of the huge slowdown: +config RANDOMIZE_BASE_LARGE + bool "Increase the randomization range of the kernel image" + depends on X86_64 && RANDOMIZE_BASE + select X86_PIE + select X86_MODULE_PLTS if MODULES + default n + ---help--- + Build the kernel as a Position Independent Executable (PIE) and + increase the available randomization range from 1GB to 3GB. + + This option impacts performance on kernel CPU intensive workloads up + to 10% due to PIE generated code. Impact on user-mode processes and + typical usage would be significantly less (0.50% when you build the + kernel). + + The kernel and modules will generate slightly more assembly (1 to 2% + increase on the .text sections). The vmlinux binary will be + significantly smaller due to less relocations. To put 10% kernel overhead into perspective: enabling this option wipes out about 5-10 years worth of painstaking optimizations we've done to keep the kernel fast ... (!!) I think the fundamental flaw is the assumption that we need a PIE executable to have a freely relocatable kernel on 64-bit CPUs. Have you considered a kernel with -mcmodel=small (or medium) instead of -fpie -mcmodel=large? We can pick a random 2GB window in the (non-kernel) canonical x86-64 address space to randomize the location of kernel text. The location of modules can be further randomized within that 2GB window. It should have far less performance impact than the register-losing and overhead-inducing -fpie / -mcmodel=large (for modules) execution models. My quick guess is tha the performance impact might be close to zero in fact. Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Fri, Aug 11, 2017 at 5:41 AM, Ingo Molnarwrote: > > * Thomas Garnier wrote: > >> Changes: >> - v2: >>- Add support for global stack cookie while compiler default to fs without >> mcmodel=kernel >>- Change patch 7 to correctly jump out of the identity mapping on kexec >> load >> preserve. >> >> These patches make the changes necessary to build the kernel as Position >> Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below >> the top 2G of the virtual address space. It allows to optionally extend the >> KASLR randomization range from 1G to 3G. > > So this: > > 61 files changed, 923 insertions(+), 299 deletions(-) > > ... is IMHO an _awful_ lot of churn and extra complexity in pretty fragile > pieces > of code, to gain what appears to be only ~1.5 more bits of randomization! The range increase is a way to use PIE right away. > > Do these changes get us closer to being able to build the kernel as truly > position > independent, i.e. to place it anywhere in the valid x86-64 address space? Or > any > other advantages? Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to have a full randomized address space where position and order of sections are completely random. There is still some work to get there but being able to build a PIE kernel is a significant step. > > Thanks, > > Ingo -- Thomas ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
* Thomas Garnierwrote: > Changes: > - v2: >- Add support for global stack cookie while compiler default to fs without > mcmodel=kernel >- Change patch 7 to correctly jump out of the identity mapping on kexec > load > preserve. > > These patches make the changes necessary to build the kernel as Position > Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below > the top 2G of the virtual address space. It allows to optionally extend the > KASLR randomization range from 1G to 3G. So this: 61 files changed, 923 insertions(+), 299 deletions(-) ... is IMHO an _awful_ lot of churn and extra complexity in pretty fragile pieces of code, to gain what appears to be only ~1.5 more bits of randomization! Do these changes get us closer to being able to build the kernel as truly position independent, i.e. to place it anywhere in the valid x86-64 address space? Or any other advantages? Thanks, Ingo ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Tue, 18 Jul 2017, Thomas Garnier wrote: > Performance/Size impact: > Hackbench (50% and 1600% loads): > - PIE enabled: 7% to 8% on half load, 10% on heavy load. > slab_test (average of 10 runs): > - PIE enabled: 3% to 4% > Kernbench (average of 10 Half and Optimal runs): > - PIE enabled: 5% to 6% > > Size of vmlinux (Ubuntu configuration): > File size: > - PIE disabled: 472928672 bytes (-0.000169% from baseline) > - PIE enabled: 216878461 bytes (-54.14% from baseline) Maybe we need something like CONFIG_PARANOIA so that we can determine at build time how much performance we want to sacrifice for performance? Its going to be difficult to understand what all these hardening config options do. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel