Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-10-20 Thread Ingo Molnar

* Pavel Machek  wrote:

> On Mon 2017-09-25 09:33:42, Ingo Molnar wrote:
> > 
> > * Pavel Machek  wrote:
> > 
> > > > For example, there would be collision with regular user-space mappings, 
> > > > right? 
> > > > Can local unprivileged users use mmap(MAP_FIXED) probing to figure out 
> > > > where 
> > > > the kernel lives?
> > > 
> > > Local unpriviledged users can probably get your secret bits using cache 
> > > probing 
> > > and jump prediction buffers.
> > > 
> > > Yes, you don't want to leak the information using mmap(MAP_FIXED), but 
> > > CPU will 
> > > leak it for you, anyway.
> > 
> > Depends on the CPU I think, and CPU vendors are busy trying to mitigate 
> > this 
> > angle.
> 
> I believe any x86 CPU running Linux will leak it. And with CPU vendors
> putting "artifical inteligence" into branch prediction, no, I don't
> think it is going to get better.
> 
> That does not mean we shoudl not prevent mmap() info leak, but...

That might or might not be so, but there's a world of a difference between
running a relatively long statistical attack figuring out the kernel's
location, versus being able to programmatically probe the kernel's location
by using large MAP_FIXED user-space mmap()s, within a few dozen microseconds
or so and a 100% guaranteed, non-statistical result.

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-10-06 Thread Pavel Machek
On Mon 2017-09-25 09:33:42, Ingo Molnar wrote:
> 
> * Pavel Machek  wrote:
> 
> > > For example, there would be collision with regular user-space mappings, 
> > > right? 
> > > Can local unprivileged users use mmap(MAP_FIXED) probing to figure out 
> > > where 
> > > the kernel lives?
> > 
> > Local unpriviledged users can probably get your secret bits using cache 
> > probing 
> > and jump prediction buffers.
> > 
> > Yes, you don't want to leak the information using mmap(MAP_FIXED), but CPU 
> > will 
> > leak it for you, anyway.
> 
> Depends on the CPU I think, and CPU vendors are busy trying to mitigate this 
> angle.

I believe any x86 CPU running Linux will leak it. And with CPU vendors
putting "artifical inteligence" into branch prediction, no, I don't
think it is going to get better.

That does not mean we shoudl not prevent mmap() info leak, but...

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-10-02 Thread Thomas Garnier
On Sat, Sep 23, 2017 at 2:43 AM, Ingo Molnar  wrote:
>
> * Thomas Garnier  wrote:
>
>> >   2) we first implement the additional entropy bits that Linus suggested.
>> >
>> > does this work for you?
>>
>> Sure, I can look at how feasible that is. If it is, can I send
>> everything as part of the same patch set? The additional entropy would
>> be enabled for all KASLR but PIE will be off-by-default of course.
>
> Sure, can all be part of the same series.

I looked deeper in the change Linus proposed (moving the .text section
based on the cacheline). I think the complexity is too high for the
value of this change.

To move only the .text section would require at least the following changes:
 - Overall change on how relocations are processed, need to separate
relocations in and outside of the .text section.
 - Break assumptions on _text alignment while keeping calculation on
size accurate (for example _end - _text).

With a rough attempt at this, I managed to pass early boot and still
crash later on.

This change would be valuable if you leak the address of a section
other than .text and you want to know where .text is. Meaning the main
bug that you are trying to exploit only allow you to execute code (and
you are trying to ROP in .text). I would argue that a better
mitigation for this type of bugs is moving function pointer to
read-only sections and using stack cookies (for ret address). This
change won't prevent other type of attacks, like data corruption.

I think it would be more valuable to look at something like selfrando
/ pagerando [1] but maybe wait a bit for it to be more mature
(especially on the debugging side).

What do you think?

[1] http://lists.llvm.org/pipermail/llvm-dev/2017-June/113794.html

>
> Thanks,
>
> Ingo



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-25 Thread Ingo Molnar

* Pavel Machek  wrote:

> > For example, there would be collision with regular user-space mappings, 
> > right? 
> > Can local unprivileged users use mmap(MAP_FIXED) probing to figure out 
> > where 
> > the kernel lives?
> 
> Local unpriviledged users can probably get your secret bits using cache 
> probing 
> and jump prediction buffers.
> 
> Yes, you don't want to leak the information using mmap(MAP_FIXED), but CPU 
> will 
> leak it for you, anyway.

Depends on the CPU I think, and CPU vendors are busy trying to mitigate this 
angle.

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-24 Thread Pavel Machek
Hi!

> > We do need to consider how we want modules to fit into whatever model we
> > choose, though.  They can be adjacent, or we could go with a more
> > traditional dynamic link model where the modules can be separate, and
> > chained together with the main kernel via the GOT.
> 
> So I believe we should start with 'adjacent'. The thing is, having modules 
> separately randomized mostly helps if any of the secret locations fails and
> we want to prevent hopping from one to the other. But if one the 
> kernel-privileged
> secret location fails then KASLR has already failed to a significant degree...
> 
> So I think the large-PIC model for modules does not buy us any real 
> advantages in 
> practice, and the disadvantages of large-PIC are real and most Linux users 
> have to 
> pay that cost unconditionally, as distro kernels have half of their kernel 
> functionality living in modules.
> 
> But I do see fundamental value in being able to hide the kernel somewhere in 
> a ~48 
> bits address space, especially if we also implement Linus's suggestion to 
> utilize 
> the lower bits as well. 0..281474976710656 is a nicely large range and will 
> get 
> larger with time.
> 
> But it should all be done smartly and carefully:
> 
> For example, there would be collision with regular user-space mappings, right?
> Can local unprivileged users use mmap(MAP_FIXED) probing to figure out where
> the kernel lives?

Local unpriviledged users can probably get your secret bits using
cache probing and jump prediction buffers.

Yes, you don't want to leak the information using mmap(MAP_FIXED), but
CPU will leak it for you, anyway.
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-23 Thread Ingo Molnar

* H. Peter Anvin  wrote:

> We do need to consider how we want modules to fit into whatever model we
> choose, though.  They can be adjacent, or we could go with a more
> traditional dynamic link model where the modules can be separate, and
> chained together with the main kernel via the GOT.

So I believe we should start with 'adjacent'. The thing is, having modules 
separately randomized mostly helps if any of the secret locations fails and
we want to prevent hopping from one to the other. But if one the 
kernel-privileged
secret location fails then KASLR has already failed to a significant degree...

So I think the large-PIC model for modules does not buy us any real advantages 
in 
practice, and the disadvantages of large-PIC are real and most Linux users have 
to 
pay that cost unconditionally, as distro kernels have half of their kernel 
functionality living in modules.

But I do see fundamental value in being able to hide the kernel somewhere in a 
~48 
bits address space, especially if we also implement Linus's suggestion to 
utilize 
the lower bits as well. 0..281474976710656 is a nicely large range and will get 
larger with time.

But it should all be done smartly and carefully:

For example, there would be collision with regular user-space mappings, right?
Can local unprivileged users use mmap(MAP_FIXED) probing to figure out where
the kernel lives?

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-23 Thread Ingo Molnar

* H. Peter Anvin  wrote:

> On 09/22/17 09:32, Ingo Molnar wrote:
> > 
> > BTW., I think things improved with ORC because with ORC we have RBP as an 
> > extra 
> > register and with PIE we lose RBX - so register pressure in code generation 
> > is 
> > lower.
> > 
> 
> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64
> has RIP-relative addressing there is no need for a dedicated PIC register.

Indeed, but we'd use a new register _a lot_ for constructs, transforming:

  movr9,QWORD PTR [r11*8-0x7e3da060] (8 bytes)

into:

  learbx,[rip+] (7 bytes)
  movr9,QWORD PTR [rbx+r11*8] (6 bytes)

... which I suppose is quite close to (but not the same as) 'losing' RBX.

Of course the compiler can pick other registers as well, not that it matters 
much 
to register pressure in larger functions in the end. Plus if the compiler has 
to 
pick a callee-saved register there's the additional saving/restoring overhead 
of 
that as well.

Right?

> I'm somewhat confused how we can have as much as almost 1% overhead.  I 
> suspect 
> that we end up making a GOT and maybe even a PLT for no good reason.

So the above transformation alone would explain a good chunk of the overhead I 
think.

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-23 Thread Ingo Molnar

* Thomas Garnier  wrote:

> >   2) we first implement the additional entropy bits that Linus suggested.
> >
> > does this work for you?
> 
> Sure, I can look at how feasible that is. If it is, can I send
> everything as part of the same patch set? The additional entropy would
> be enabled for all KASLR but PIE will be off-by-default of course.

Sure, can all be part of the same series.

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-23 Thread hjl . tools


On September 23, 2017 3:06:16 AM GMT+08:00, "H. Peter Anvin"  
wrote:
>On 09/22/17 11:57, Kees Cook wrote:
>> On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvin 
>wrote:
>>> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since
>x86-64
>>> has RIP-relative addressing there is no need for a dedicated PIC
>register.
>> 
>> FWIW, since gcc 5, the PIC register isn't totally lost. It is now
>> reusable, and that seems to have improved performance:
>> https://gcc.gnu.org/gcc-5/changes.html
>
>It still talks about a PIC register on x86-64, which confuses me.
>Perhaps older gcc's would allocate a PIC register under certain
>circumstances, and then lose it for the entire function?
>
>For i386, the PIC register is required by the ABI to be %ebx at the
>point any PLT entry is called.  Not an issue with -mno-plt which goes
>straight to the GOT, although in most cases there needs to be a PIC
>register to find the GOT unless load-time relocation is permitted.
>
>   -hpa

We need a static PIE option so that compiler can optimize it
without using hidden visibility.

H.J.
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-23 Thread hjl . tools
,Andrew Morton ,"Paul E . 
McKenney" ,Nicolas Pitre 
,Christopher Li ,"Rafael J . 
Wysocki" ,Lukas Wunner ,Mika 
Westerberg ,Dou Liyang 
,Daniel Borkmann ,Alexei 
Starovoitov ,Masahiro Yamada 
,Markus Trippelsdorf 
,Steven Rostedt ,Rik van Riel 
,David Howells ,Waiman Long 
,Kyle Huey ,Peter Foley 
,Tim Chen ,Catalin Marinas 
,Ard Biesheuvel ,Michal 
Hocko ,Matthew Wilcox ,Paul Bolle 
,Rob Landley ,Baoquan He
,Daniel Micay ,the arch/x86 maintainers 
,Linux Crypto Mailing List ,LKML 
,xen-devel ,kvm 
list ,Linux PM list ,linux-arch 
,Sparse Mailing-list 
,Kernel Hardening 
,Linus Torvalds 
,Peter Zijlstra 
,Borislav Petkov 
From: "H.J. Lu" 
Message-ID: 



On September 23, 2017 3:06:16 AM GMT+08:00, "H. Peter Anvin"  
wrote:
>On 09/22/17 11:57, Kees Cook wrote:
>> On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvin 
>wrote:
>>> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since
>x86-64
>>> has RIP-relative addressing there is no need for a dedicated PIC
>register.
>> 
>> FWIW, since gcc 5, the PIC register isn't totally lost. It is now
>> reusable, and that seems to have improved performance:
>> https://gcc.gnu.org/gcc-5/changes.html
>
>It still talks about a PIC register on x86-64, which confuses me.
>Perhaps older gcc's would allocate a PIC register under certain
>circumstances, and then lose it for the entire function?
>
>For i386, the PIC register is required by the ABI to be %ebx at the
>point any PLT entry is called.  Not an issue with -mno-plt which goes
>straight to the GOT, although in most cases there needs to be a PIC
>register to find the GOT unless load-time relocation is permitted.
>
>   -hpa
We need a static PIE option so that compiler can optimize it
without using hidden visibility.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread Thomas Garnier
On Thu, Sep 21, 2017 at 2:21 PM, Thomas Garnier  wrote:
> On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvel
>  wrote:
>>
>> On 21 September 2017 at 08:59, Ingo Molnar  wrote:
>> >
>> > ( Sorry about the delay in answering this. I could blame the delay on the 
>> > merge
>> >   window, but in reality I've been procrastinating this is due to the 
>> > permanent,
>> >   non-trivial impact PIE has on generated C code. )
>> >
>> > * Thomas Garnier  wrote:
>> >
>> >> 1) PIE sometime needs two instructions to represent a single
>> >> instruction on mcmodel=kernel.
>> >
>> > What again is the typical frequency of this occurring in an x86-64 
>> > defconfig
>> > kernel, with the very latest GCC?
>> >
>> > Also, to make sure: which unwinder did you use for your measurements,
>> > frame-pointers or ORC? Please use ORC only for future numbers, as
>> > frame-pointers is obsolete from a performance measurement POV.
>> >
>> >> 2) GCC does not optimize switches in PIE in order to reduce relocations:
>> >
>> > Hopefully this can either be fixed in GCC or at least influenced via a 
>> > compiler
>> > switch in the future.
>> >
>>
>> There are somewhat related concerns in the ARM world, so it would be
>> good if we could work with the GCC developers to get a more high level
>> and arch neutral command line option (-mkernel-pie? sounds yummy!)
>> that stops the compiler from making inferences that only hold for
>> shared libraries and/or other hosted executables (GOT indirections,
>> avoiding text relocations etc). That way, we will also be able to drop
>> the 'hidden' visibility override at some point, which we currently
>> need to prevent the compiler from redirecting all global symbol
>> references via entries in the GOT.
>
> My plan was to add a -mtls-reg= to switch the default segment
> register for stack cookies but I can see great benefits in having a
> more general kernel flag that would allow to get rid of the GOT and
> PLT when you are building position independent code for the kernel. It
> could also include optimizations like folding switch tables etc...
>
> Should we start a separate discussion on that? Anyone that would be
> more experienced than I to push that to gcc & clang upstream?

After separate discussion, opened:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

>
>>
>> All we really need is the ability to move the image around in virtual
>> memory, and things like reducing the CoW footprint or enabling ELF
>> symbol preemption are completely irrelevant for us.
>
>
>
>
> --
> Thomas



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread H. Peter Anvin
On 09/22/17 11:57, Kees Cook wrote:
> On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvin  wrote:
>> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64
>> has RIP-relative addressing there is no need for a dedicated PIC register.
> 
> FWIW, since gcc 5, the PIC register isn't totally lost. It is now
> reusable, and that seems to have improved performance:
> https://gcc.gnu.org/gcc-5/changes.html

It still talks about a PIC register on x86-64, which confuses me.
Perhaps older gcc's would allocate a PIC register under certain
circumstances, and then lose it for the entire function?

For i386, the PIC register is required by the ABI to be %ebx at the
point any PLT entry is called.  Not an issue with -mno-plt which goes
straight to the GOT, although in most cases there needs to be a PIC
register to find the GOT unless load-time relocation is permitted.

-hpa


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread H. Peter Anvin
On 09/22/17 09:32, Ingo Molnar wrote:
> 
> BTW., I think things improved with ORC because with ORC we have RBP as an 
> extra 
> register and with PIE we lose RBX - so register pressure in code generation 
> is 
> lower.
> 

We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64
has RIP-relative addressing there is no need for a dedicated PIC register.

I'm somewhat confused how we can have as much as almost 1% overhead.  I
suspect that we end up making a GOT and maybe even a PLT for no good reason.

-hpa

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread Thomas Garnier
On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvin  wrote:
> On 09/22/17 09:32, Ingo Molnar wrote:
>>
>> BTW., I think things improved with ORC because with ORC we have RBP as an 
>> extra
>> register and with PIE we lose RBX - so register pressure in code generation 
>> is
>> lower.
>>
>
> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64
> has RIP-relative addressing there is no need for a dedicated PIC register.
>
> I'm somewhat confused how we can have as much as almost 1% overhead.  I
> suspect that we end up making a GOT and maybe even a PLT for no good reason.

We have a GOT with very few entries, mainly linker script globals that
I think we can work to reduce or remove.

We have a PLT but it is empty. On latest iteration (not sent yet),
modules have PLT32 relocations but no PLT entry. I got rid of
mcmodel=large for modules and instead I move the beginning of the
module section just after the kernel so relative relocations work.

>
> -hpa



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread Kees Cook
On Fri, Sep 22, 2017 at 11:38 AM, H. Peter Anvin  wrote:
> We lose EBX on 32 bits, but we don't lose RBX on 64 bits - since x86-64
> has RIP-relative addressing there is no need for a dedicated PIC register.

FWIW, since gcc 5, the PIC register isn't totally lost. It is now
reusable, and that seems to have improved performance:
https://gcc.gnu.org/gcc-5/changes.html

-Kees

-- 
Kees Cook
Pixel Security

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread H. Peter Anvin
On 08/21/17 07:28, Peter Zijlstra wrote:
> 
> Ah, I see, this is large mode and that needs to use MOVABS to load 64bit
> immediates. Still, small RIP relative should be able to live at any
> point as long as everything lives inside the same 2G relative range, so
> would still allow the goal of increasing the KASLR range.
> 
> So I'm not seeing how we need large mode for that. That said, after
> reading up on all this, RIP relative will not be too pretty either,
> while CALL is naturally RIP relative, data still needs an explicit %rip
> offset, still loads better than the large model.
> 

The large model makes no sense whatsoever.  I think what we're actually
looking for is the small-PIC model.

Ingo asked:
> I.e. is there no GCC code generation mode where code can be placed anywhere 
> in the 
> canonical address space, yet call and jump distance is within 31 bits so that 
> the 
> generated code is fast?

That's the small-PIC model.  I think if all symbols are forced to hidden
then it won't even need a GOT/PLT.

We do need to consider how we want modules to fit into whatever model we
choose, though.  They can be adjacent, or we could go with a more
traditional dynamic link model where the modules can be separate, and
chained together with the main kernel via the GOT.

-hpa

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread Thomas Garnier
On Fri, Sep 22, 2017 at 9:32 AM, Ingo Molnar  wrote:
>
> * Thomas Garnier  wrote:
>
>> On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar  wrote:
>> >
>> > ( Sorry about the delay in answering this. I could blame the delay on the 
>> > merge
>> >   window, but in reality I've been procrastinating this is due to the 
>> > permanent,
>> >   non-trivial impact PIE has on generated C code. )
>> >
>> > * Thomas Garnier  wrote:
>> >
>> >> 1) PIE sometime needs two instructions to represent a single
>> >> instruction on mcmodel=kernel.
>> >
>> > What again is the typical frequency of this occurring in an x86-64 
>> > defconfig
>> > kernel, with the very latest GCC?
>>
>> I am not sure what is the best way to measure that.
>
> If this is the dominant factor then 'sizeof vmlinux' ought to be enough:
>
>> With ORC: PIE .text is 0.814224% than baseline
>
> I.e. the overhead is +0.81% in both size and (roughly) in number of 
> instructions
> executed.
>
> BTW., I think things improved with ORC because with ORC we have RBP as an 
> extra
> register and with PIE we lose RBX - so register pressure in code generation is
> lower.

That make sense.

>
> Ok, I suspect we can try it, but my preconditions for merging it would be:
>
>   1) Linus doesn't NAK it (obviously)

Of course.

>   2) we first implement the additional entropy bits that Linus suggested.
>
> does this work for you?

Sure, I can look at how feasible that is. If it is, can I send
everything as part of the same patch set? The additional entropy would
be enabled for all KASLR but PIE will be off-by-default of course.

>
> Thanks,
>
> Ingo



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread Ingo Molnar

* Thomas Garnier  wrote:

> On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar  wrote:
> >
> > ( Sorry about the delay in answering this. I could blame the delay on the 
> > merge
> >   window, but in reality I've been procrastinating this is due to the 
> > permanent,
> >   non-trivial impact PIE has on generated C code. )
> >
> > * Thomas Garnier  wrote:
> >
> >> 1) PIE sometime needs two instructions to represent a single
> >> instruction on mcmodel=kernel.
> >
> > What again is the typical frequency of this occurring in an x86-64 defconfig
> > kernel, with the very latest GCC?
> 
> I am not sure what is the best way to measure that.

If this is the dominant factor then 'sizeof vmlinux' ought to be enough:

> With ORC: PIE .text is 0.814224% than baseline

I.e. the overhead is +0.81% in both size and (roughly) in number of 
instructions 
executed.

BTW., I think things improved with ORC because with ORC we have RBP as an extra 
register and with PIE we lose RBX - so register pressure in code generation is 
lower.

Ok, I suspect we can try it, but my preconditions for merging it would be:

  1) Linus doesn't NAK it (obviously)
  2) we first implement the additional entropy bits that Linus suggested.

does this work for you?

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-22 Thread Thomas Garnier
On Thu, Sep 21, 2017 at 9:24 PM, Markus Trippelsdorf
 wrote:
> On 2017.09.21 at 14:21 -0700, Thomas Garnier wrote:
>> On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvel
>>  wrote:
>> >
>> > On 21 September 2017 at 08:59, Ingo Molnar  wrote:
>> > >
>> > > ( Sorry about the delay in answering this. I could blame the delay on 
>> > > the merge
>> > >   window, but in reality I've been procrastinating this is due to the 
>> > > permanent,
>> > >   non-trivial impact PIE has on generated C code. )
>> > >
>> > > * Thomas Garnier  wrote:
>> > >
>> > >> 1) PIE sometime needs two instructions to represent a single
>> > >> instruction on mcmodel=kernel.
>> > >
>> > > What again is the typical frequency of this occurring in an x86-64 
>> > > defconfig
>> > > kernel, with the very latest GCC?
>> > >
>> > > Also, to make sure: which unwinder did you use for your measurements,
>> > > frame-pointers or ORC? Please use ORC only for future numbers, as
>> > > frame-pointers is obsolete from a performance measurement POV.
>> > >
>> > >> 2) GCC does not optimize switches in PIE in order to reduce relocations:
>> > >
>> > > Hopefully this can either be fixed in GCC or at least influenced via a 
>> > > compiler
>> > > switch in the future.
>> > >
>> >
>> > There are somewhat related concerns in the ARM world, so it would be
>> > good if we could work with the GCC developers to get a more high level
>> > and arch neutral command line option (-mkernel-pie? sounds yummy!)
>> > that stops the compiler from making inferences that only hold for
>> > shared libraries and/or other hosted executables (GOT indirections,
>> > avoiding text relocations etc). That way, we will also be able to drop
>> > the 'hidden' visibility override at some point, which we currently
>> > need to prevent the compiler from redirecting all global symbol
>> > references via entries in the GOT.
>>
>> My plan was to add a -mtls-reg= to switch the default segment
>> register for stack cookies but I can see great benefits in having a
>> more general kernel flag that would allow to get rid of the GOT and
>> PLT when you are building position independent code for the kernel. It
>> could also include optimizations like folding switch tables etc...
>>
>> Should we start a separate discussion on that? Anyone that would be
>> more experienced than I to push that to gcc & clang upstream?
>
> Just open a gcc bug. See
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81708 as an example.

Make sense, I will look into this. Thanks Andy for the stack cookie bug!

>
> --
> Markus



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-21 Thread Markus Trippelsdorf
On 2017.09.21 at 14:21 -0700, Thomas Garnier wrote:
> On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvel
>  wrote:
> >
> > On 21 September 2017 at 08:59, Ingo Molnar  wrote:
> > >
> > > ( Sorry about the delay in answering this. I could blame the delay on the 
> > > merge
> > >   window, but in reality I've been procrastinating this is due to the 
> > > permanent,
> > >   non-trivial impact PIE has on generated C code. )
> > >
> > > * Thomas Garnier  wrote:
> > >
> > >> 1) PIE sometime needs two instructions to represent a single
> > >> instruction on mcmodel=kernel.
> > >
> > > What again is the typical frequency of this occurring in an x86-64 
> > > defconfig
> > > kernel, with the very latest GCC?
> > >
> > > Also, to make sure: which unwinder did you use for your measurements,
> > > frame-pointers or ORC? Please use ORC only for future numbers, as
> > > frame-pointers is obsolete from a performance measurement POV.
> > >
> > >> 2) GCC does not optimize switches in PIE in order to reduce relocations:
> > >
> > > Hopefully this can either be fixed in GCC or at least influenced via a 
> > > compiler
> > > switch in the future.
> > >
> >
> > There are somewhat related concerns in the ARM world, so it would be
> > good if we could work with the GCC developers to get a more high level
> > and arch neutral command line option (-mkernel-pie? sounds yummy!)
> > that stops the compiler from making inferences that only hold for
> > shared libraries and/or other hosted executables (GOT indirections,
> > avoiding text relocations etc). That way, we will also be able to drop
> > the 'hidden' visibility override at some point, which we currently
> > need to prevent the compiler from redirecting all global symbol
> > references via entries in the GOT.
> 
> My plan was to add a -mtls-reg= to switch the default segment
> register for stack cookies but I can see great benefits in having a
> more general kernel flag that would allow to get rid of the GOT and
> PLT when you are building position independent code for the kernel. It
> could also include optimizations like folding switch tables etc...
> 
> Should we start a separate discussion on that? Anyone that would be
> more experienced than I to push that to gcc & clang upstream?

Just open a gcc bug. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81708 as an example.

-- 
Markus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-21 Thread Thomas Garnier
On Thu, Sep 21, 2017 at 2:16 PM, Thomas Garnier  wrote:
>
> On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar  wrote:
> >
> > ( Sorry about the delay in answering this. I could blame the delay on the 
> > merge
> >   window, but in reality I've been procrastinating this is due to the 
> > permanent,
> >   non-trivial impact PIE has on generated C code. )
> >
> > * Thomas Garnier  wrote:
> >
> >> 1) PIE sometime needs two instructions to represent a single
> >> instruction on mcmodel=kernel.
> >
> > What again is the typical frequency of this occurring in an x86-64 defconfig
> > kernel, with the very latest GCC?
>
> I am not sure what is the best way to measure that.

A very approximate approach would be to look at each instruction using
the signed trick with a _32S relocation. All _32S relocations won't be
translated to more instructions because some are just relocating part
of an absolute mov which would be actually smaller if relative.

Used this command to get a relative estimate:

objdump -dr ./baseline/vmlinux | egrep -A 2 '\-0x[0-9a-f]{8}' | grep
_32S | wc -l

Got 6130 places, if you assume each add at least 7 bytes. It adds at
least 42910 bytes on the .text section. The text section is 78599
bytes bigger from baseline to PIE. That's at least 54% of the size
difference. Assuming we found all of them and we can't factor the
impact on using an additional register.

Similar approach with the switch table but a bit more complex:

1) Find all constructs as with an lea (%rip) followed by a jmp
instruction inside a function (typical unfolded switch case).
2) Remove occurrences of less than 4 for the destination address

Result: 480 switch cases in 49 functions. Each case take at least 9
bytes and the switch itself takes 16 bytes (assuming one per
function).

That's 5104 bytes for easy to identify switches (less than 7% of the increase).

I am certainly missing a lot of differences. I checked if the percpu
changes impacted the size and it doesn't (only 3 bytes added on PIE).

I also tried different ways to compare the .text section like size of
symbols or number of bytes on full disassembly but the results are
really off from the whole .text size so I am not sure if it is the
right way to go about it.

>
> >
> > Also, to make sure: which unwinder did you use for your measurements,
> > frame-pointers or ORC? Please use ORC only for future numbers, as
> > frame-pointers is obsolete from a performance measurement POV.
>
> I used the default configuration which uses frame-pointer. I built all
> the different binaries with ORC and I see an improvement in size:
>
> On latest revision (just built and ran performance tests this week):
>
> With framepointer: PIE .text is 0.837324% than baseline
>
> With ORC: PIE .text is 0.814224% than baseline
>
> Comparing baselines only, ORC is -2.849832% than frame-pointers.
>
> >
> >> 2) GCC does not optimize switches in PIE in order to reduce relocations:
> >
> > Hopefully this can either be fixed in GCC or at least influenced via a 
> > compiler
> > switch in the future.
> >
> >> The switches are the biggest increase on small functions but I don't
> >> think they represent a large portion of the difference (number 1 is).
> >
> > Ok.
> >
> >> A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE
> >> kernel being faster by 1% across multiple runs (comparing 50 runs done
> >> across 5 reboots twice). I don't think PIE is faster than a
> >> mcmodel=kernel but recent versions of gcc makes them fairly similar.
> >
> > So I think we are down to an overhead range where the inherent noise (both 
> > random
> > and systematic one) in 'hackbench' overwhelms the signal we are trying to 
> > measure.
> >
> > So I think it's the kernel .text size change that is the best noise-free 
> > proxy for
> > the overhead impact of PIE.
>
> I agree but it might be hard to measure the exact impact. What is
> acceptable and what is not?
>
> >
> > It doesn't hurt to double check actual real performance as well, just don't 
> > expect
> > there to be much of a signal for anything but fully cached microbenchmark
> > workloads.
>
> That's aligned with what I see in the latest performance testing.
> Performance is close enough that it is hard to get exact numbers (pie
> is just a bit slower than baseline on hackench (~1%)).
>
> >
> > Thanks,
> >
> > Ingo
>
>
>
> --
> Thomas




-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-21 Thread Thomas Garnier
On Thu, Sep 21, 2017 at 9:10 AM, Ard Biesheuvel
 wrote:
>
> On 21 September 2017 at 08:59, Ingo Molnar  wrote:
> >
> > ( Sorry about the delay in answering this. I could blame the delay on the 
> > merge
> >   window, but in reality I've been procrastinating this is due to the 
> > permanent,
> >   non-trivial impact PIE has on generated C code. )
> >
> > * Thomas Garnier  wrote:
> >
> >> 1) PIE sometime needs two instructions to represent a single
> >> instruction on mcmodel=kernel.
> >
> > What again is the typical frequency of this occurring in an x86-64 defconfig
> > kernel, with the very latest GCC?
> >
> > Also, to make sure: which unwinder did you use for your measurements,
> > frame-pointers or ORC? Please use ORC only for future numbers, as
> > frame-pointers is obsolete from a performance measurement POV.
> >
> >> 2) GCC does not optimize switches in PIE in order to reduce relocations:
> >
> > Hopefully this can either be fixed in GCC or at least influenced via a 
> > compiler
> > switch in the future.
> >
>
> There are somewhat related concerns in the ARM world, so it would be
> good if we could work with the GCC developers to get a more high level
> and arch neutral command line option (-mkernel-pie? sounds yummy!)
> that stops the compiler from making inferences that only hold for
> shared libraries and/or other hosted executables (GOT indirections,
> avoiding text relocations etc). That way, we will also be able to drop
> the 'hidden' visibility override at some point, which we currently
> need to prevent the compiler from redirecting all global symbol
> references via entries in the GOT.

My plan was to add a -mtls-reg= to switch the default segment
register for stack cookies but I can see great benefits in having a
more general kernel flag that would allow to get rid of the GOT and
PLT when you are building position independent code for the kernel. It
could also include optimizations like folding switch tables etc...

Should we start a separate discussion on that? Anyone that would be
more experienced than I to push that to gcc & clang upstream?

>
> All we really need is the ability to move the image around in virtual
> memory, and things like reducing the CoW footprint or enabling ELF
> symbol preemption are completely irrelevant for us.




-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-21 Thread Thomas Garnier
On Thu, Sep 21, 2017 at 8:59 AM, Ingo Molnar  wrote:
>
> ( Sorry about the delay in answering this. I could blame the delay on the 
> merge
>   window, but in reality I've been procrastinating this is due to the 
> permanent,
>   non-trivial impact PIE has on generated C code. )
>
> * Thomas Garnier  wrote:
>
>> 1) PIE sometime needs two instructions to represent a single
>> instruction on mcmodel=kernel.
>
> What again is the typical frequency of this occurring in an x86-64 defconfig
> kernel, with the very latest GCC?

I am not sure what is the best way to measure that.

>
> Also, to make sure: which unwinder did you use for your measurements,
> frame-pointers or ORC? Please use ORC only for future numbers, as
> frame-pointers is obsolete from a performance measurement POV.

I used the default configuration which uses frame-pointer. I built all
the different binaries with ORC and I see an improvement in size:

On latest revision (just built and ran performance tests this week):

With framepointer: PIE .text is 0.837324% than baseline

With ORC: PIE .text is 0.814224% than baseline

Comparing baselines only, ORC is -2.849832% than frame-pointers.

>
>> 2) GCC does not optimize switches in PIE in order to reduce relocations:
>
> Hopefully this can either be fixed in GCC or at least influenced via a 
> compiler
> switch in the future.
>
>> The switches are the biggest increase on small functions but I don't
>> think they represent a large portion of the difference (number 1 is).
>
> Ok.
>
>> A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE
>> kernel being faster by 1% across multiple runs (comparing 50 runs done
>> across 5 reboots twice). I don't think PIE is faster than a
>> mcmodel=kernel but recent versions of gcc makes them fairly similar.
>
> So I think we are down to an overhead range where the inherent noise (both 
> random
> and systematic one) in 'hackbench' overwhelms the signal we are trying to 
> measure.
>
> So I think it's the kernel .text size change that is the best noise-free 
> proxy for
> the overhead impact of PIE.

I agree but it might be hard to measure the exact impact. What is
acceptable and what is not?

>
> It doesn't hurt to double check actual real performance as well, just don't 
> expect
> there to be much of a signal for anything but fully cached microbenchmark
> workloads.

That's aligned with what I see in the latest performance testing.
Performance is close enough that it is hard to get exact numbers (pie
is just a bit slower than baseline on hackench (~1%)).

>
> Thanks,
>
> Ingo



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-21 Thread Ard Biesheuvel
On 21 September 2017 at 08:59, Ingo Molnar  wrote:
>
> ( Sorry about the delay in answering this. I could blame the delay on the 
> merge
>   window, but in reality I've been procrastinating this is due to the 
> permanent,
>   non-trivial impact PIE has on generated C code. )
>
> * Thomas Garnier  wrote:
>
>> 1) PIE sometime needs two instructions to represent a single
>> instruction on mcmodel=kernel.
>
> What again is the typical frequency of this occurring in an x86-64 defconfig
> kernel, with the very latest GCC?
>
> Also, to make sure: which unwinder did you use for your measurements,
> frame-pointers or ORC? Please use ORC only for future numbers, as
> frame-pointers is obsolete from a performance measurement POV.
>
>> 2) GCC does not optimize switches in PIE in order to reduce relocations:
>
> Hopefully this can either be fixed in GCC or at least influenced via a 
> compiler
> switch in the future.
>

There are somewhat related concerns in the ARM world, so it would be
good if we could work with the GCC developers to get a more high level
and arch neutral command line option (-mkernel-pie? sounds yummy!)
that stops the compiler from making inferences that only hold for
shared libraries and/or other hosted executables (GOT indirections,
avoiding text relocations etc). That way, we will also be able to drop
the 'hidden' visibility override at some point, which we currently
need to prevent the compiler from redirecting all global symbol
references via entries in the GOT.

All we really need is the ability to move the image around in virtual
memory, and things like reducing the CoW footprint or enabling ELF
symbol preemption are completely irrelevant for us.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-09-21 Thread Ingo Molnar

( Sorry about the delay in answering this. I could blame the delay on the merge 
  window, but in reality I've been procrastinating this is due to the permanent,
  non-trivial impact PIE has on generated C code. )

* Thomas Garnier  wrote:

> 1) PIE sometime needs two instructions to represent a single
> instruction on mcmodel=kernel.

What again is the typical frequency of this occurring in an x86-64 defconfig 
kernel, with the very latest GCC?

Also, to make sure: which unwinder did you use for your measurements, 
frame-pointers or ORC? Please use ORC only for future numbers, as
frame-pointers is obsolete from a performance measurement POV.

> 2) GCC does not optimize switches in PIE in order to reduce relocations:

Hopefully this can either be fixed in GCC or at least influenced via a compiler 
switch in the future.

> The switches are the biggest increase on small functions but I don't
> think they represent a large portion of the difference (number 1 is).

Ok.

> A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE
> kernel being faster by 1% across multiple runs (comparing 50 runs done
> across 5 reboots twice). I don't think PIE is faster than a
> mcmodel=kernel but recent versions of gcc makes them fairly similar.

So I think we are down to an overhead range where the inherent noise (both 
random 
and systematic one) in 'hackbench' overwhelms the signal we are trying to 
measure.

So I think it's the kernel .text size change that is the best noise-free proxy 
for 
the overhead impact of PIE.

It doesn't hurt to double check actual real performance as well, just don't 
expect 
there to be much of a signal for anything but fully cached microbenchmark 
workloads.

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-29 Thread Thomas Garnier
On Fri, Aug 25, 2017 at 8:05 AM, Thomas Garnier  wrote:
> On Fri, Aug 25, 2017 at 1:04 AM, Ingo Molnar  wrote:
>>
>> * Thomas Garnier  wrote:
>>
>>> With the fix for function tracing, the hackbench results have an
>>> average of +0.8 to +1.4% (from +8% to +10% before). With a default
>>> configuration, the numbers are closer to 0.8%.
>>>
>>> On the .text size, with gcc 4.9 I see +0.8% on default configuration
>>> and +1.180% on the ubuntu configuration.
>>
>> A 1% text size increase is still significant. Could you look at the 
>> disassembly,
>> where does the size increase come from?
>
> I will take a look, in this current iteration I added the .got and
> .got.plt so removing them will remove a big (even if they are small,
> we don't use them to increase perf).
>
> What do you think about the perf numbers in general so far?

I looked at the size increase. I could identify two common cases:

1) PIE sometime needs two instructions to represent a single
instruction on mcmodel=kernel.

For example, this instruction plays on the sign extension (mcmodel=kernel):

movr9,QWORD PTR [r11*8-0x7e3da060] (8 bytes)

The address 0x81c25fa0 can be represented as -0x7e3da060 using
a 32S relocation.

with PIE:

learbx,[rip+] (7 bytes)
movr9,QWORD PTR [rbx+r11*8] (6 bytes)

2) GCC does not optimize switches in PIE in order to reduce relocations:

For example the switch in phy_modes [1]:

static inline const char *phy_modes(phy_interface_t interface)
{
switch (interface) {
case PHY_INTERFACE_MODE_NA:
return "";
case PHY_INTERFACE_MODE_INTERNAL:
return "internal";
case PHY_INTERFACE_MODE_MII:
return "mii";

Without PIE (gcc 7.2.0), the whole table is optimize to be one instruction:

   0x0040045b <+27>:movrdi,QWORD PTR [rax*8+0x400660]

With PIE (gcc 7.2.0):

   0x0641 <+33>:movsxd rax,DWORD PTR [rdx+rax*4]
   0x0645 <+37>:addrax,rdx
   0x0648 <+40>:jmprax

   0x065d <+61>:leardi,[rip+0x264]# 0x8c8
   0x0664 <+68>:jmp0x651 
   0x0666 <+70>:leardi,[rip+0x2bc]# 0x929
   0x066d <+77>:jmp0x651 
   0x066f <+79>:leardi,[rip+0x2a8]# 0x91e
   0x0676 <+86>:jmp0x651 
   0x0678 <+88>:leardi,[rip+0x294]# 0x913
   0x067f <+95>:jmp0x651 

That's a deliberate choice, clang is able to optimize it (clang-3.8):

   0x0963 <+19>:learcx,[rip+0x200406]# 0x200d70
   0x096a <+26>:movrdi,QWORD PTR [rcx+rax*8]

I checked gcc and the code deciding to fold the switch basically do
not do it for pic to reduce relocations [2].

The switches are the biggest increase on small functions but I don't
think they represent a large portion of the difference (number 1 is).

A side note, while testing gcc 7.2.0 on hackbench I have seen the PIE
kernel being faster by 1% across multiple runs (comparing 50 runs done
across 5 reboots twice). I don't think PIE is faster than a
mcmodel=kernel but recent versions of gcc makes them fairly similar.

[1] 
http://elixir.free-electrons.com/linux/v4.13-rc7/source/include/linux/phy.h#L113
[2] 
https://github.com/gcc-mirror/gcc/blob/7977b0509f07e42fbe0f06efcdead2b7e4a5135f/gcc/tree-switch-conversion.c#L828

>
>>
>> Thanks,
>>
>> Ingo
>
>
>
> --
> Thomas



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-27 Thread H. Peter Anvin
On 08/21/17 07:31, Peter Zijlstra wrote:
> On Tue, Aug 15, 2017 at 07:20:38AM -0700, Thomas Garnier wrote:
>> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:
> 
>>> Have you considered a kernel with -mcmodel=small (or medium) instead of 
>>> -fpie
>>> -mcmodel=large? We can pick a random 2GB window in the (non-kernel) 
>>> canonical
>>> x86-64 address space to randomize the location of kernel text. The location 
>>> of
>>> modules can be further randomized within that 2GB window.
>>
>> -model=small/medium assume you are on the low 32-bit. It generates
>> instructions where the virtual addresses have the high 32-bit to be
>> zero.
> 
> That's a compiler fail, right? Because the SDM states that for "CALL
> rel32" the 32bit displacement is sign extended on x86_64.
> 

No.  It is about whether you can do something like:

movl $variable, %eax/* rax =  */

or

addl %ecx,variable(,%rsi,4) /* variable[rsi] += ecx */

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-25 Thread Thomas Garnier
On Thu, Aug 24, 2017 at 2:42 PM, Linus Torvalds
 wrote:
>
> On Thu, Aug 24, 2017 at 2:13 PM, Thomas Garnier  wrote:
> >
> > My original performance testing was done with an Ubuntu generic
> > configuration. This configuration has the CONFIG_FUNCTION_TRACER
> > option which was incompatible with PIE. The tracer failed to replace
> > the __fentry__ call by a nop slide on each traceable function because
> > the instruction was not the one expected. If PIE is enabled, gcc
> > generates a difference call instruction based on the GOT without
> > checking the visibility options (basically call *__fentry__@GOTPCREL).
>
> Gah.
>
> Don't we actually have *more* address bits for randomization at the
> low end, rather than getting rid of -mcmodel=kernel?

We have but I think we use most of it for potential modules and the
fixmap but it is not that big. The increase in range from 1G to 3G is
just an example and a way to ensure PIE work as expected. The long
term goal is being able to put the kernel where we want in memory,
randomizing the position and the order of almost all memory sections.

That would be valuable against BTB attack [1] for example where
randomization on the low 32-bit is ineffective.

[1] https://github.com/felixwilhelm/mario_baslr

>
> Has anybody looked at just moving kernel text by smaller values than
> the page size? Yeah, yeah, the kernel has several sections that need
> page alignment, but I think we could relocate normal text by just the
> cacheline size, and that sounds like it would give several bits of
> randomness with little downside.

I didn't look into it. There is value in it depending on performance
impact. I think both PIE and lower grain randomization would be
useful.

>
> Or has somebody already looked at it and I just missed it?
>
>Linus




-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-25 Thread Thomas Garnier
On Fri, Aug 25, 2017 at 1:04 AM, Ingo Molnar  wrote:
>
> * Thomas Garnier  wrote:
>
>> With the fix for function tracing, the hackbench results have an
>> average of +0.8 to +1.4% (from +8% to +10% before). With a default
>> configuration, the numbers are closer to 0.8%.
>>
>> On the .text size, with gcc 4.9 I see +0.8% on default configuration
>> and +1.180% on the ubuntu configuration.
>
> A 1% text size increase is still significant. Could you look at the 
> disassembly,
> where does the size increase come from?

I will take a look, in this current iteration I added the .got and
.got.plt so removing them will remove a big (even if they are small,
we don't use them to increase perf).

What do you think about the perf numbers in general so far?

>
> Thanks,
>
> Ingo



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-25 Thread Ingo Molnar

* Thomas Garnier  wrote:

> With the fix for function tracing, the hackbench results have an
> average of +0.8 to +1.4% (from +8% to +10% before). With a default
> configuration, the numbers are closer to 0.8%.
> 
> On the .text size, with gcc 4.9 I see +0.8% on default configuration
> and +1.180% on the ubuntu configuration.

A 1% text size increase is still significant. Could you look at the 
disassembly, 
where does the size increase come from?

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-24 Thread Steven Rostedt
On Thu, 24 Aug 2017 14:13:38 -0700
Thomas Garnier  wrote:

> With the fix for function tracing, the hackbench results have an
> average of +0.8 to +1.4% (from +8% to +10% before). With a default
> configuration, the numbers are closer to 0.8%.

Wow, an empty fentry function not "nop"ed out only added 8% to 10%
overhead. I never did the benchmarks of that since I did it before
fentry was introduced, which was with the old "mcount". That gave an
average of 13% overhead in hackbench.

-- Steve

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-24 Thread Linus Torvalds
On Thu, Aug 24, 2017 at 2:13 PM, Thomas Garnier  wrote:
>
> My original performance testing was done with an Ubuntu generic
> configuration. This configuration has the CONFIG_FUNCTION_TRACER
> option which was incompatible with PIE. The tracer failed to replace
> the __fentry__ call by a nop slide on each traceable function because
> the instruction was not the one expected. If PIE is enabled, gcc
> generates a difference call instruction based on the GOT without
> checking the visibility options (basically call *__fentry__@GOTPCREL).

Gah.

Don't we actually have *more* address bits for randomization at the
low end, rather than getting rid of -mcmodel=kernel?

Has anybody looked at just moving kernel text by smaller values than
the page size? Yeah, yeah, the kernel has several sections that need
page alignment, but I think we could relocate normal text by just the
cacheline size, and that sounds like it would give several bits of
randomness with little downside.

Or has somebody already looked at it and I just missed it?

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-24 Thread Thomas Garnier
On Thu, Aug 17, 2017 at 7:10 AM, Thomas Garnier  wrote:
>
> On Thu, Aug 17, 2017 at 1:09 AM, Ingo Molnar  wrote:
> >
> >
> > * Thomas Garnier  wrote:
> >
> > > > > -model=small/medium assume you are on the low 32-bit. It generates
> > > > > instructions where the virtual addresses have the high 32-bit to be 
> > > > > zero.
> > > >
> > > > How are these assumptions hardcoded by GCC? Most of the instructions 
> > > > should be
> > > > relocatable straight away, as most call/jump/branch instructions are
> > > > RIP-relative.
> > >
> > > I think PIE is capable to use relative instructions well. mcmodel=large 
> > > assumes
> > > symbols can be anywhere.
> >
> > So if the numbers in your changelog and Kconfig text cannot be trusted, 
> > there's
> > this description of the size impact which I suspect is less susceptible to
> > measurement error:
> >
> > + The kernel and modules will generate slightly more assembly (1 to 
> > 2%
> > + increase on the .text sections). The vmlinux binary will be
> > + significantly smaller due to less relocations.
> >
> > ... but describing a 1-2% kernel text size increase as "slightly more 
> > assembly"
> > shows a gratituous disregard to kernel code generation quality! In reality 
> > that's
> > a huge size increase that in most cases will almost directly transfer to a 
> > 1-2%
> > slowdown for kernel intense workloads.
> >
> >
> > Where does that size increase come from, if PIE is capable of using relative
> > instructins well? Does it come from the loss of a generic register and the
> > resulting increase in register pressure, stack spills, etc.?
>
> I will try to gather more information on the size increase. The size
> increase might be smaller with gcc 4.9 given performance was much
> better.

Coming back on this thread as I identified the root cause of the
performance issue.

My original performance testing was done with an Ubuntu generic
configuration. This configuration has the CONFIG_FUNCTION_TRACER
option which was incompatible with PIE. The tracer failed to replace
the __fentry__ call by a nop slide on each traceable function because
the instruction was not the one expected. If PIE is enabled, gcc
generates a difference call instruction based on the GOT without
checking the visibility options (basically call *__fentry__@GOTPCREL).

With the fix for function tracing, the hackbench results have an
average of +0.8 to +1.4% (from +8% to +10% before). With a default
configuration, the numbers are closer to 0.8%.

On the .text size, with gcc 4.9 I see +0.8% on default configuration
and +1.180% on the ubuntu configuration.

Next iteration should have an updated set of performance metrics (will
try to use gcc 6.0 or higher) and incorporate the fix on function
tracing.

Let me know if you have questions and feedback.

>
> >
> > So I'm still unhappy about this all, and about the attitude surrounding it.
> >
> > Thanks,
> >
> > Ingo
>
>
>
>
> --
> Thomas




-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-21 Thread Thomas Garnier
On Mon, Aug 21, 2017 at 7:31 AM, Peter Zijlstra  wrote:
> On Tue, Aug 15, 2017 at 07:20:38AM -0700, Thomas Garnier wrote:
>> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:
>
>> > Have you considered a kernel with -mcmodel=small (or medium) instead of 
>> > -fpie
>> > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) 
>> > canonical
>> > x86-64 address space to randomize the location of kernel text. The 
>> > location of
>> > modules can be further randomized within that 2GB window.
>>
>> -model=small/medium assume you are on the low 32-bit. It generates
>> instructions where the virtual addresses have the high 32-bit to be
>> zero.
>
> That's a compiler fail, right? Because the SDM states that for "CALL
> rel32" the 32bit displacement is sign extended on x86_64.
>

That's different than what I expected at first too.

Now, I think I have an alternative of using mcmodel=large. I could use
-fPIC and ensure modules are never far away from the main kernel
(moving the module section start close to the random kernel end). I
looked at it and that seems possible but will require more work. I
plan to start with the mcmodel=large support and add this mode in a
way that could benefit classic KASLR (without -fPIC) because it
randomize where modules start based on the kernel.

-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-21 Thread Peter Zijlstra
On Tue, Aug 15, 2017 at 07:20:38AM -0700, Thomas Garnier wrote:
> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:

> > Have you considered a kernel with -mcmodel=small (or medium) instead of 
> > -fpie
> > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) 
> > canonical
> > x86-64 address space to randomize the location of kernel text. The location 
> > of
> > modules can be further randomized within that 2GB window.
> 
> -model=small/medium assume you are on the low 32-bit. It generates
> instructions where the virtual addresses have the high 32-bit to be
> zero.

That's a compiler fail, right? Because the SDM states that for "CALL
rel32" the 32bit displacement is sign extended on x86_64.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-21 Thread Peter Zijlstra
On Mon, Aug 21, 2017 at 03:32:22PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 16, 2017 at 05:12:35PM +0200, Ingo Molnar wrote:
> > Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine 
> > instruction level.
> > 
> > Function calls look like this:
> > 
> >  -mcmodel=medium:
> > 
> >757:   e8 98 ff ff ff  callq  6f4 
> > 
> >  -mcmodel=large
> > 
> >77b:   48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax
> >782:   ff ff ff 
> >785:   48 8d 04 03 lea(%rbx,%rax,1),%rax
> >789:   ff d0   callq  *%rax
> > 
> > And we'd do this for _EVERY_ function call in the kernel. That kind of crap 
> > is 
> > totally unacceptable.
> 
> So why does this need to be computed for every single call? How often
> will we move the kernel around at runtime?
> 
> Why can't we process the relocation at load time and then discard the
> relocation tables along with the rest of __init ?

Ah, I see, this is large mode and that needs to use MOVABS to load 64bit
immediates. Still, small RIP relative should be able to live at any
point as long as everything lives inside the same 2G relative range, so
would still allow the goal of increasing the KASLR range.

So I'm not seeing how we need large mode for that. That said, after
reading up on all this, RIP relative will not be too pretty either,
while CALL is naturally RIP relative, data still needs an explicit %rip
offset, still loads better than the large model.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-21 Thread Peter Zijlstra
On Wed, Aug 16, 2017 at 05:12:35PM +0200, Ingo Molnar wrote:
> Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine 
> instruction level.
> 
> Function calls look like this:
> 
>  -mcmodel=medium:
> 
>757:   e8 98 ff ff ff  callq  6f4 
> 
>  -mcmodel=large
> 
>77b:   48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax
>782:   ff ff ff 
>785:   48 8d 04 03 lea(%rbx,%rax,1),%rax
>789:   ff d0   callq  *%rax
> 
> And we'd do this for _EVERY_ function call in the kernel. That kind of crap 
> is 
> totally unacceptable.

So why does this need to be computed for every single call? How often
will we move the kernel around at runtime?

Why can't we process the relocation at load time and then discard the
relocation tables along with the rest of __init ?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-17 Thread Thomas Garnier
On Thu, Aug 17, 2017 at 1:09 AM, Ingo Molnar  wrote:
>
>
> * Thomas Garnier  wrote:
>
> > > > -model=small/medium assume you are on the low 32-bit. It generates
> > > > instructions where the virtual addresses have the high 32-bit to be 
> > > > zero.
> > >
> > > How are these assumptions hardcoded by GCC? Most of the instructions 
> > > should be
> > > relocatable straight away, as most call/jump/branch instructions are
> > > RIP-relative.
> >
> > I think PIE is capable to use relative instructions well. mcmodel=large 
> > assumes
> > symbols can be anywhere.
>
> So if the numbers in your changelog and Kconfig text cannot be trusted, 
> there's
> this description of the size impact which I suspect is less susceptible to
> measurement error:
>
> + The kernel and modules will generate slightly more assembly (1 to 2%
> + increase on the .text sections). The vmlinux binary will be
> + significantly smaller due to less relocations.
>
> ... but describing a 1-2% kernel text size increase as "slightly more 
> assembly"
> shows a gratituous disregard to kernel code generation quality! In reality 
> that's
> a huge size increase that in most cases will almost directly transfer to a 
> 1-2%
> slowdown for kernel intense workloads.
>
>
> Where does that size increase come from, if PIE is capable of using relative
> instructins well? Does it come from the loss of a generic register and the
> resulting increase in register pressure, stack spills, etc.?

I will try to gather more information on the size increase. The size
increase might be smaller with gcc 4.9 given performance was much
better.

>
> So I'm still unhappy about this all, and about the attitude surrounding it.
>
> Thanks,
>
> Ingo




-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-17 Thread Ingo Molnar

* Thomas Garnier  wrote:

> > > -model=small/medium assume you are on the low 32-bit. It generates 
> > > instructions where the virtual addresses have the high 32-bit to be zero.
> >
> > How are these assumptions hardcoded by GCC? Most of the instructions should 
> > be 
> > relocatable straight away, as most call/jump/branch instructions are 
> > RIP-relative.
> 
> I think PIE is capable to use relative instructions well. mcmodel=large 
> assumes 
> symbols can be anywhere.

So if the numbers in your changelog and Kconfig text cannot be trusted, there's 
this description of the size impact which I suspect is less susceptible to 
measurement error:

+ The kernel and modules will generate slightly more assembly (1 to 2%
+ increase on the .text sections). The vmlinux binary will be
+ significantly smaller due to less relocations.

... but describing a 1-2% kernel text size increase as "slightly more assembly" 
shows a gratituous disregard to kernel code generation quality! In reality 
that's 
a huge size increase that in most cases will almost directly transfer to a 1-2% 
slowdown for kernel intense workloads.

Where does that size increase come from, if PIE is capable of using relative 
instructins well? Does it come from the loss of a generic register and the 
resulting increase in register pressure, stack spills, etc.?

So I'm still unhappy about this all, and about the attitude surrounding it.

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-16 Thread Thomas Garnier
On Wed, Aug 16, 2017 at 8:12 AM, Ingo Molnar  wrote:
>
>
> * Thomas Garnier  wrote:
>
> > On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:
> > >
> > > * Thomas Garnier  wrote:
> > >
> > >> > Do these changes get us closer to being able to build the kernel as 
> > >> > truly
> > >> > position independent, i.e. to place it anywhere in the valid x86-64 
> > >> > address
> > >> > space? Or any other advantages?
> > >>
> > >> Yes, PIE allows us to put the kernel anywhere in memory. It will allow 
> > >> us to
> > >> have a full randomized address space where position and order of 
> > >> sections are
> > >> completely random. There is still some work to get there but being able 
> > >> to build
> > >> a PIE kernel is a significant step.
> > >
> > > So I _really_ dislike the whole PIE approach, because of the huge 
> > > slowdown:
> > >
> > > +config RANDOMIZE_BASE_LARGE
> > > +   bool "Increase the randomization range of the kernel image"
> > > +   depends on X86_64 && RANDOMIZE_BASE
> > > +   select X86_PIE
> > > +   select X86_MODULE_PLTS if MODULES
> > > +   default n
> > > +   ---help---
> > > + Build the kernel as a Position Independent Executable (PIE) and
> > > + increase the available randomization range from 1GB to 3GB.
> > > +
> > > + This option impacts performance on kernel CPU intensive 
> > > workloads up
> > > + to 10% due to PIE generated code. Impact on user-mode processes 
> > > and
> > > + typical usage would be significantly less (0.50% when you build 
> > > the
> > > + kernel).
> > > +
> > > + The kernel and modules will generate slightly more assembly (1 
> > > to 2%
> > > + increase on the .text sections). The vmlinux binary will be
> > > + significantly smaller due to less relocations.
> > >
> > > To put 10% kernel overhead into perspective: enabling this option wipes 
> > > out about
> > > 5-10 years worth of painstaking optimizations we've done to keep the 
> > > kernel fast
> > > ... (!!)
> >
> > Note that 10% is the high-bound of a CPU intensive workload.
>
> Note that the 8-10% hackbench or even a 2%-4% range would be 'huge' in terms 
> of
> modern kernel performance. In many cases we are literally applying cycle level
> optimizations that are barely measurable. A 0.1% speedup in linear execution 
> speed
> is already a big success.
>
> > I am going to start doing performance testing on -mcmodel=large to see if 
> > it is
> > faster than -fPIE.
>
> Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine
> instruction level.
>
> Function calls look like this:
>
>  -mcmodel=medium:
>
>757:   e8 98 ff ff ff  callq  6f4 
>
>  -mcmodel=large
>
>77b:   48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax
>782:   ff ff ff
>785:   48 8d 04 03 lea(%rbx,%rax,1),%rax
>789:   ff d0   callq  *%rax
>
> And we'd do this for _EVERY_ function call in the kernel. That kind of crap is
> totally unacceptable.
>

I started looking into mcmodel=large and ran into multiple issues. In
the meantime, i thought I would
try difference configurations and compilers.

I did 10 hackbench runs accross 10 reboots with and without pie (same
commit) with gcc 4.9. I copied
the result below and based on the hackbench configuration we are
between -0.29% and 1.92% (average
across is 0.8%) which seems more aligned with what people discussed in
this thread.

I don't know how I got 10% maximum on hackbench, I am still
investigating. It could be the configuration
I used or my base compiler being too old.

> > > I think the fundamental flaw is the assumption that we need a PIE 
> > > executable
> > > to have a freely relocatable kernel on 64-bit CPUs.
> > >
> > > Have you considered a kernel with -mcmodel=small (or medium) instead of 
> > > -fpie
> > > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) 
> > > canonical
> > > x86-64 address space to randomize the location of kernel text. The 
> > > location of
> > > modules can be further randomized within that 2GB window.
> >
> > -model=small/medium assume you are on the low 32-bit. It generates 
> > instructions
> > where the virtual addresses have the high 32-bit to be zero.
>
> How are these assumptions hardcoded by GCC? Most of the instructions should be
> relocatable straight away, as most call/jump/branch instructions are 
> RIP-relative.

I think PIE is capable to use relative instructions well.
mcmodel=large assumes symbols can be anywhere.

>
> I.e. is there no GCC code generation mode where code can be placed anywhere 
> in the
> canonical address space, yet call and jump distance is within 31 bits so that 
> the
> generated code is fast?

I think that's basically PIE. With PIE, you have the assumption
everything is close, the main issue is any assembly referencing
absolute addresses.

>
> 

Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-16 Thread Ard Biesheuvel
On 16 August 2017 at 17:26, Daniel Micay  wrote:
>> How are these assumptions hardcoded by GCC? Most of the instructions
>> should be
>> relocatable straight away, as most call/jump/branch instructions are
>> RIP-relative.
>>
>> I.e. is there no GCC code generation mode where code can be placed
>> anywhere in the
>> canonical address space, yet call and jump distance is within 31 bits
>> so that the
>> generated code is fast?
>
> That's what PIE is meant to do. However, not disabling support for lazy
> linking (-fno-plt) / symbol interposition (-Bsymbolic) is going to cause
> it to add needless overhead.
>
> arm64 is using -pie -shared -Bsymbolic in arch/arm64/Makefile for their
> CONFIG_RELOCATABLE option. See 08cc55b2afd97a654f71b3bebf8bb0ec89fdc498.

The difference with arm64 is that its generic small code model is
already position independent, so we don't have to pass -fpic or -fpie
to the compiler. We only link in PIE mode to get the linker to emit
the dynamic relocation tables into the ELF binary. Relative branches
have a range of +/- 128 MB, which covers the kernel and modules
(unless the option to randomize the module region independently has
been selected, in which case branches between the kernel and modules
may be resolved via PLT entries that are emitted at module load time)

I am not sure how this extrapolates to x86, just adding some context.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-16 Thread Daniel Micay
> How are these assumptions hardcoded by GCC? Most of the instructions
> should be 
> relocatable straight away, as most call/jump/branch instructions are
> RIP-relative.
> 
> I.e. is there no GCC code generation mode where code can be placed
> anywhere in the 
> canonical address space, yet call and jump distance is within 31 bits
> so that the 
> generated code is fast?

That's what PIE is meant to do. However, not disabling support for lazy
linking (-fno-plt) / symbol interposition (-Bsymbolic) is going to cause
it to add needless overhead.

arm64 is using -pie -shared -Bsymbolic in arch/arm64/Makefile for their
CONFIG_RELOCATABLE option. See 08cc55b2afd97a654f71b3bebf8bb0ec89fdc498.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-16 Thread Christopher Lameter
On Wed, 16 Aug 2017, Ingo Molnar wrote:

> And we'd do this for _EVERY_ function call in the kernel. That kind of crap is
> totally unacceptable.

Ahh finally a limit is in sight as to how much security hardening etc can
reduce kernel performance.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-16 Thread Ingo Molnar

* Thomas Garnier  wrote:

> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:
> >
> > * Thomas Garnier  wrote:
> >
> >> > Do these changes get us closer to being able to build the kernel as truly
> >> > position independent, i.e. to place it anywhere in the valid x86-64 
> >> > address
> >> > space? Or any other advantages?
> >>
> >> Yes, PIE allows us to put the kernel anywhere in memory. It will allow us 
> >> to
> >> have a full randomized address space where position and order of sections 
> >> are
> >> completely random. There is still some work to get there but being able to 
> >> build
> >> a PIE kernel is a significant step.
> >
> > So I _really_ dislike the whole PIE approach, because of the huge slowdown:
> >
> > +config RANDOMIZE_BASE_LARGE
> > +   bool "Increase the randomization range of the kernel image"
> > +   depends on X86_64 && RANDOMIZE_BASE
> > +   select X86_PIE
> > +   select X86_MODULE_PLTS if MODULES
> > +   default n
> > +   ---help---
> > + Build the kernel as a Position Independent Executable (PIE) and
> > + increase the available randomization range from 1GB to 3GB.
> > +
> > + This option impacts performance on kernel CPU intensive workloads 
> > up
> > + to 10% due to PIE generated code. Impact on user-mode processes 
> > and
> > + typical usage would be significantly less (0.50% when you build 
> > the
> > + kernel).
> > +
> > + The kernel and modules will generate slightly more assembly (1 to 
> > 2%
> > + increase on the .text sections). The vmlinux binary will be
> > + significantly smaller due to less relocations.
> >
> > To put 10% kernel overhead into perspective: enabling this option wipes out 
> > about
> > 5-10 years worth of painstaking optimizations we've done to keep the kernel 
> > fast
> > ... (!!)
> 
> Note that 10% is the high-bound of a CPU intensive workload.

Note that the 8-10% hackbench or even a 2%-4% range would be 'huge' in terms of 
modern kernel performance. In many cases we are literally applying cycle level 
optimizations that are barely measurable. A 0.1% speedup in linear execution 
speed 
is already a big success.

> I am going to start doing performance testing on -mcmodel=large to see if it 
> is 
> faster than -fPIE.

Unfortunately mcmodel=large looks pretty heavy too AFAICS, at the machine 
instruction level.

Function calls look like this:

 -mcmodel=medium:

   757:   e8 98 ff ff ff  callq  6f4 

 -mcmodel=large

   77b:   48 b8 10 f7 df ff ffmovabs $0xffdff710,%rax
   782:   ff ff ff 
   785:   48 8d 04 03 lea(%rbx,%rax,1),%rax
   789:   ff d0   callq  *%rax

And we'd do this for _EVERY_ function call in the kernel. That kind of crap is 
totally unacceptable.

> > I think the fundamental flaw is the assumption that we need a PIE 
> > executable 
> > to have a freely relocatable kernel on 64-bit CPUs.
> >
> > Have you considered a kernel with -mcmodel=small (or medium) instead of 
> > -fpie 
> > -mcmodel=large? We can pick a random 2GB window in the (non-kernel) 
> > canonical 
> > x86-64 address space to randomize the location of kernel text. The location 
> > of 
> > modules can be further randomized within that 2GB window.
> 
> -model=small/medium assume you are on the low 32-bit. It generates 
> instructions 
> where the virtual addresses have the high 32-bit to be zero.

How are these assumptions hardcoded by GCC? Most of the instructions should be 
relocatable straight away, as most call/jump/branch instructions are 
RIP-relative.

I.e. is there no GCC code generation mode where code can be placed anywhere in 
the 
canonical address space, yet call and jump distance is within 31 bits so that 
the 
generated code is fast?

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-15 Thread Thomas Garnier
On Tue, Aug 15, 2017 at 7:47 AM, Daniel Micay  wrote:
> On 15 August 2017 at 10:20, Thomas Garnier  wrote:
>> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:
>>>
>>> * Thomas Garnier  wrote:
>>>
 > Do these changes get us closer to being able to build the kernel as truly
 > position independent, i.e. to place it anywhere in the valid x86-64 
 > address
 > space? Or any other advantages?

 Yes, PIE allows us to put the kernel anywhere in memory. It will allow us 
 to
 have a full randomized address space where position and order of sections 
 are
 completely random. There is still some work to get there but being able to 
 build
 a PIE kernel is a significant step.
>>>
>>> So I _really_ dislike the whole PIE approach, because of the huge slowdown:
>>>
>>> +config RANDOMIZE_BASE_LARGE
>>> +   bool "Increase the randomization range of the kernel image"
>>> +   depends on X86_64 && RANDOMIZE_BASE
>>> +   select X86_PIE
>>> +   select X86_MODULE_PLTS if MODULES
>>> +   default n
>>> +   ---help---
>>> + Build the kernel as a Position Independent Executable (PIE) and
>>> + increase the available randomization range from 1GB to 3GB.
>>> +
>>> + This option impacts performance on kernel CPU intensive workloads 
>>> up
>>> + to 10% due to PIE generated code. Impact on user-mode processes 
>>> and
>>> + typical usage would be significantly less (0.50% when you build 
>>> the
>>> + kernel).
>>> +
>>> + The kernel and modules will generate slightly more assembly (1 to 
>>> 2%
>>> + increase on the .text sections). The vmlinux binary will be
>>> + significantly smaller due to less relocations.
>>>
>>> To put 10% kernel overhead into perspective: enabling this option wipes out 
>>> about
>>> 5-10 years worth of painstaking optimizations we've done to keep the kernel 
>>> fast
>>> ... (!!)
>>
>> Note that 10% is the high-bound of a CPU intensive workload.
>
> The cost can be reduced by using -fno-plt these days but some work
> might be required to make that work with the kernel.
>
> Where does that 10% estimate in the kernel config docs come from? I'd
> be surprised if it really cost that much on x86_64. That's a realistic
> cost for i386 with modern GCC (it used to be worse) but I'd expect
> x86_64 to be closer to 2% even for CPU intensive workloads. It should
> be very close to zero with -fno-plt.

I got 8 to 10% on hackbench. Other benchmarks were 4% or lower.

I will do look at more recent compiler and no-plt as well.

-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-15 Thread Daniel Micay
On 15 August 2017 at 10:20, Thomas Garnier  wrote:
> On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:
>>
>> * Thomas Garnier  wrote:
>>
>>> > Do these changes get us closer to being able to build the kernel as truly
>>> > position independent, i.e. to place it anywhere in the valid x86-64 
>>> > address
>>> > space? Or any other advantages?
>>>
>>> Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to
>>> have a full randomized address space where position and order of sections 
>>> are
>>> completely random. There is still some work to get there but being able to 
>>> build
>>> a PIE kernel is a significant step.
>>
>> So I _really_ dislike the whole PIE approach, because of the huge slowdown:
>>
>> +config RANDOMIZE_BASE_LARGE
>> +   bool "Increase the randomization range of the kernel image"
>> +   depends on X86_64 && RANDOMIZE_BASE
>> +   select X86_PIE
>> +   select X86_MODULE_PLTS if MODULES
>> +   default n
>> +   ---help---
>> + Build the kernel as a Position Independent Executable (PIE) and
>> + increase the available randomization range from 1GB to 3GB.
>> +
>> + This option impacts performance on kernel CPU intensive workloads 
>> up
>> + to 10% due to PIE generated code. Impact on user-mode processes and
>> + typical usage would be significantly less (0.50% when you build the
>> + kernel).
>> +
>> + The kernel and modules will generate slightly more assembly (1 to 
>> 2%
>> + increase on the .text sections). The vmlinux binary will be
>> + significantly smaller due to less relocations.
>>
>> To put 10% kernel overhead into perspective: enabling this option wipes out 
>> about
>> 5-10 years worth of painstaking optimizations we've done to keep the kernel 
>> fast
>> ... (!!)
>
> Note that 10% is the high-bound of a CPU intensive workload.

The cost can be reduced by using -fno-plt these days but some work
might be required to make that work with the kernel.

Where does that 10% estimate in the kernel config docs come from? I'd
be surprised if it really cost that much on x86_64. That's a realistic
cost for i386 with modern GCC (it used to be worse) but I'd expect
x86_64 to be closer to 2% even for CPU intensive workloads. It should
be very close to zero with -fno-plt.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-15 Thread Thomas Garnier
On Tue, Aug 15, 2017 at 12:56 AM, Ingo Molnar  wrote:
>
> * Thomas Garnier  wrote:
>
>> > Do these changes get us closer to being able to build the kernel as truly
>> > position independent, i.e. to place it anywhere in the valid x86-64 address
>> > space? Or any other advantages?
>>
>> Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to
>> have a full randomized address space where position and order of sections are
>> completely random. There is still some work to get there but being able to 
>> build
>> a PIE kernel is a significant step.
>
> So I _really_ dislike the whole PIE approach, because of the huge slowdown:
>
> +config RANDOMIZE_BASE_LARGE
> +   bool "Increase the randomization range of the kernel image"
> +   depends on X86_64 && RANDOMIZE_BASE
> +   select X86_PIE
> +   select X86_MODULE_PLTS if MODULES
> +   default n
> +   ---help---
> + Build the kernel as a Position Independent Executable (PIE) and
> + increase the available randomization range from 1GB to 3GB.
> +
> + This option impacts performance on kernel CPU intensive workloads up
> + to 10% due to PIE generated code. Impact on user-mode processes and
> + typical usage would be significantly less (0.50% when you build the
> + kernel).
> +
> + The kernel and modules will generate slightly more assembly (1 to 2%
> + increase on the .text sections). The vmlinux binary will be
> + significantly smaller due to less relocations.
>
> To put 10% kernel overhead into perspective: enabling this option wipes out 
> about
> 5-10 years worth of painstaking optimizations we've done to keep the kernel 
> fast
> ... (!!)

Note that 10% is the high-bound of a CPU intensive workload.

>
> I think the fundamental flaw is the assumption that we need a PIE executable 
> to
> have a freely relocatable kernel on 64-bit CPUs.
>
> Have you considered a kernel with -mcmodel=small (or medium) instead of -fpie
> -mcmodel=large? We can pick a random 2GB window in the (non-kernel) canonical
> x86-64 address space to randomize the location of kernel text. The location of
> modules can be further randomized within that 2GB window.

-model=small/medium assume you are on the low 32-bit. It generates
instructions where the virtual addresses have the high 32-bit to be
zero.

I am going to start doing performance testing on -mcmodel=large to see
if it is faster than -fPIE.

>
> It should have far less performance impact than the register-losing and
> overhead-inducing -fpie / -mcmodel=large (for modules) execution models.
>
> My quick guess is tha the performance impact might be close to zero in fact.

If mcmodel=small/medium was possible for kernel, I don't think it
would have less performance impact than mcmodel=large. It would still
need to set the high 32-bit to be a static value, only the relocation
would be a different size.

>
> Thanks,
>
> Ingo



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-15 Thread Ingo Molnar

* Thomas Garnier  wrote:

> > Do these changes get us closer to being able to build the kernel as truly 
> > position independent, i.e. to place it anywhere in the valid x86-64 address 
> > space? Or any other advantages?
> 
> Yes, PIE allows us to put the kernel anywhere in memory. It will allow us to 
> have a full randomized address space where position and order of sections are 
> completely random. There is still some work to get there but being able to 
> build 
> a PIE kernel is a significant step.

So I _really_ dislike the whole PIE approach, because of the huge slowdown:

+config RANDOMIZE_BASE_LARGE
+   bool "Increase the randomization range of the kernel image"
+   depends on X86_64 && RANDOMIZE_BASE
+   select X86_PIE
+   select X86_MODULE_PLTS if MODULES
+   default n
+   ---help---
+ Build the kernel as a Position Independent Executable (PIE) and
+ increase the available randomization range from 1GB to 3GB.
+
+ This option impacts performance on kernel CPU intensive workloads up
+ to 10% due to PIE generated code. Impact on user-mode processes and
+ typical usage would be significantly less (0.50% when you build the
+ kernel).
+
+ The kernel and modules will generate slightly more assembly (1 to 2%
+ increase on the .text sections). The vmlinux binary will be
+ significantly smaller due to less relocations.

To put 10% kernel overhead into perspective: enabling this option wipes out 
about 
5-10 years worth of painstaking optimizations we've done to keep the kernel 
fast 
... (!!)

I think the fundamental flaw is the assumption that we need a PIE executable to 
have a freely relocatable kernel on 64-bit CPUs.

Have you considered a kernel with -mcmodel=small (or medium) instead of -fpie 
-mcmodel=large? We can pick a random 2GB window in the (non-kernel) canonical 
x86-64 address space to randomize the location of kernel text. The location of 
modules can be further randomized within that 2GB window.

It should have far less performance impact than the register-losing and 
overhead-inducing -fpie / -mcmodel=large (for modules) execution models.

My quick guess is tha the performance impact might be close to zero in fact.

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-11 Thread Thomas Garnier
On Fri, Aug 11, 2017 at 5:41 AM, Ingo Molnar  wrote:
>
> * Thomas Garnier  wrote:
>
>> Changes:
>>  - v2:
>>- Add support for global stack cookie while compiler default to fs without
>>  mcmodel=kernel
>>- Change patch 7 to correctly jump out of the identity mapping on kexec 
>> load
>>  preserve.
>>
>> These patches make the changes necessary to build the kernel as Position
>> Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
>> the top 2G of the virtual address space. It allows to optionally extend the
>> KASLR randomization range from 1G to 3G.
>
> So this:
>
>  61 files changed, 923 insertions(+), 299 deletions(-)
>
> ... is IMHO an _awful_ lot of churn and extra complexity in pretty fragile 
> pieces
> of code, to gain what appears to be only ~1.5 more bits of randomization!

The range increase is a way to use PIE right away.

>
> Do these changes get us closer to being able to build the kernel as truly 
> position
> independent, i.e. to place it anywhere in the valid x86-64 address space? Or 
> any
> other advantages?

Yes, PIE allows us to put the kernel anywhere in memory. It will allow
us to have a full randomized address space where position and order of
sections are completely random. There is still some work to get there
but being able to build a PIE kernel is a significant step.

>
> Thanks,
>
> Ingo



-- 
Thomas

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-11 Thread Ingo Molnar

* Thomas Garnier  wrote:

> Changes:
>  - v2:
>- Add support for global stack cookie while compiler default to fs without
>  mcmodel=kernel
>- Change patch 7 to correctly jump out of the identity mapping on kexec 
> load
>  preserve.
> 
> These patches make the changes necessary to build the kernel as Position
> Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
> the top 2G of the virtual address space. It allows to optionally extend the
> KASLR randomization range from 1G to 3G.

So this:

 61 files changed, 923 insertions(+), 299 deletions(-)

... is IMHO an _awful_ lot of churn and extra complexity in pretty fragile 
pieces 
of code, to gain what appears to be only ~1.5 more bits of randomization!

Do these changes get us closer to being able to build the kernel as truly 
position 
independent, i.e. to place it anywhere in the valid x86-64 address space? Or 
any 
other advantages?

Thanks,

Ingo

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-07-19 Thread Christopher Lameter
On Tue, 18 Jul 2017, Thomas Garnier wrote:

> Performance/Size impact:
> Hackbench (50% and 1600% loads):
>  - PIE enabled: 7% to 8% on half load, 10% on heavy load.
> slab_test (average of 10 runs):
>  - PIE enabled: 3% to 4%
> Kernbench (average of 10 Half and Optimal runs):
>  - PIE enabled: 5% to 6%
>
> Size of vmlinux (Ubuntu configuration):
>  File size:
>  - PIE disabled: 472928672 bytes (-0.000169% from baseline)
>  - PIE enabled: 216878461 bytes (-54.14% from baseline)

Maybe we need something like CONFIG_PARANOIA so that we can determine at
build time how much performance we want to sacrifice for performance?

Its going to be difficult to understand what all these hardening config
options do.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel