Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-21 Thread Andy Lutomirski
On Wed, Oct 21, 2015 at 12:11 PM, Dave Hansen  wrote:
> On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
>> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen  wrote:
>>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
  - Along similar considerations, also add a sys_pkey_query() system call 
 to query
the mapping of a specific pkey. (returns -EBADF or so if the key is not 
 mapped
at the moment.) This too could be vDSO accelerated in the future.

 I.e. something like:

  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
 init_val)
  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
  unsigned long sys_pkey_get   (int pkey)
  unsigned long sys_pkey_free  (int pkey)
>>>
>>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>>
>>> pkey_set() will modify the _current_ context's PKRU which includes the
>>> register itself and the kernel XSAVE buffer (if active).  But, since the
>>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>>> state set during the signal.
>>>
>>> I _think_ the right move here is to either keep a 'shadow' version of
>>> PKRU inside the kernel (for each thread) and always update the task's
>>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>>> signal's PKRU state in to the main process's PKRU state when returning
>>> from a signal.
>>
>> Ick.  Or we could just declare that signals don't affect the PKRU
>> state by default and mask it off in sigreturn.
>
> Yeah, I've been messing with it in a few forms and it's pretty ugly.
>
> I think it will be easier if we say the PKRU rights are not inherited by
> signals and changes during a signal are tossed out.  Signal handlers are
> special anyway and folks have to be careful writing them.

This is somewhat related to something I've been pondering in a
different context: fsbase and gsbase.

If a program changes fsbase using wrfsbase, should a signal handler
override it?  And should a change made in a signal handler carry over
after sigreturn?  Arguably, for fsbase and gsbase, the answer is no --
anyone use uses them for userspace threading (which is presumably why
they happened in the first place, even though userspace threading has
possibly dubious value) probably wants their context switches to stick
across signal invocations.

So I think that propagating PKRU into the signal handler and keeping
the in-register value on sigreturn by default is probably a reasonable
choice.

(OTOH, there's an argument for allowing programs to reset PKRU on
signal delivery: you could sort of arrange for signal handler to be
more privileged than the code that invokes them.  But that's doable
with some asm regardless.)

>
>> In fact, maybe we should add a general xfeature (or whatever it's
>> called these days) to the xstate in the signal context that controls
>> which pieces are restored.  Then user code can tweak it if needed in
>> signal handlers.
>
> Yeah, that's probably a good idea long-term.  We're only getting more
> and more things managed by XSAVE and it's going to be increasingly
> interesting to glue real semantics back on top.
>

Should we maybe extend copy_user_to_fpregs_zeroing to have a pair of
masks, where one mask indicates which features are copied and another
indicates which are preserved?  It looks like we already allow some
control over which bits are restored from sigcontext versus being
restored to their init state.

We might need to add some kind of extended ucontext area for this.  I
don't know if we're starting to run out of space.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-21 Thread Dave Hansen
On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen  wrote:
>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>>  - Along similar considerations, also add a sys_pkey_query() system call to 
>>> query
>>>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
>>> mapped
>>>at the moment.) This too could be vDSO accelerated in the future.
>>>
>>> I.e. something like:
>>>
>>>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
>>> init_val)
>>>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>>  unsigned long sys_pkey_get   (int pkey)
>>>  unsigned long sys_pkey_free  (int pkey)
>>
>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>
>> pkey_set() will modify the _current_ context's PKRU which includes the
>> register itself and the kernel XSAVE buffer (if active).  But, since the
>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>> state set during the signal.
>>
>> I _think_ the right move here is to either keep a 'shadow' version of
>> PKRU inside the kernel (for each thread) and always update the task's
>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>> signal's PKRU state in to the main process's PKRU state when returning
>> from a signal.
> 
> Ick.  Or we could just declare that signals don't affect the PKRU
> state by default and mask it off in sigreturn.

Yeah, I've been messing with it in a few forms and it's pretty ugly.

I think it will be easier if we say the PKRU rights are not inherited by
signals and changes during a signal are tossed out.  Signal handlers are
special anyway and folks have to be careful writing them.

> In fact, maybe we should add a general xfeature (or whatever it's
> called these days) to the xstate in the signal context that controls
> which pieces are restored.  Then user code can tweak it if needed in
> signal handlers.

Yeah, that's probably a good idea long-term.  We're only getting more
and more things managed by XSAVE and it's going to be increasingly
interesting to glue real semantics back on top.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-21 Thread Andy Lutomirski
On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen  wrote:
> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>  - Along similar considerations, also add a sys_pkey_query() system call to 
>> query
>>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
>> mapped
>>at the moment.) This too could be vDSO accelerated in the future.
>>
>> I.e. something like:
>>
>>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
>> init_val)
>>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>  unsigned long sys_pkey_get   (int pkey)
>>  unsigned long sys_pkey_free  (int pkey)
>
> The pkey_set() operation is going to get a wee bit interesting with signals.
>
> pkey_set() will modify the _current_ context's PKRU which includes the
> register itself and the kernel XSAVE buffer (if active).  But, since the
> PKRU state is saved/restored with the XSAVE state, we will blow away any
> state set during the signal.
>
> I _think_ the right move here is to either keep a 'shadow' version of
> PKRU inside the kernel (for each thread) and always update the task's
> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
> signal's PKRU state in to the main process's PKRU state when returning
> from a signal.

Ick.  Or we could just declare that signals don't affect the PKRU
state by default and mask it off in sigreturn.

In fact, maybe we should add a general xfeature (or whatever it's
called these days) to the xstate in the signal context that controls
which pieces are restored.  Then user code can tweak it if needed in
signal handlers.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-21 Thread Andy Lutomirski
On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen  wrote:
> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>  - Along similar considerations, also add a sys_pkey_query() system call to 
>> query
>>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
>> mapped
>>at the moment.) This too could be vDSO accelerated in the future.
>>
>> I.e. something like:
>>
>>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
>> init_val)
>>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>  unsigned long sys_pkey_get   (int pkey)
>>  unsigned long sys_pkey_free  (int pkey)
>
> The pkey_set() operation is going to get a wee bit interesting with signals.
>
> pkey_set() will modify the _current_ context's PKRU which includes the
> register itself and the kernel XSAVE buffer (if active).  But, since the
> PKRU state is saved/restored with the XSAVE state, we will blow away any
> state set during the signal.
>
> I _think_ the right move here is to either keep a 'shadow' version of
> PKRU inside the kernel (for each thread) and always update the task's
> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
> signal's PKRU state in to the main process's PKRU state when returning
> from a signal.

Ick.  Or we could just declare that signals don't affect the PKRU
state by default and mask it off in sigreturn.

In fact, maybe we should add a general xfeature (or whatever it's
called these days) to the xstate in the signal context that controls
which pieces are restored.  Then user code can tweak it if needed in
signal handlers.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-21 Thread Dave Hansen
On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen  wrote:
>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>>  - Along similar considerations, also add a sys_pkey_query() system call to 
>>> query
>>>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
>>> mapped
>>>at the moment.) This too could be vDSO accelerated in the future.
>>>
>>> I.e. something like:
>>>
>>>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
>>> init_val)
>>>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>>  unsigned long sys_pkey_get   (int pkey)
>>>  unsigned long sys_pkey_free  (int pkey)
>>
>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>
>> pkey_set() will modify the _current_ context's PKRU which includes the
>> register itself and the kernel XSAVE buffer (if active).  But, since the
>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>> state set during the signal.
>>
>> I _think_ the right move here is to either keep a 'shadow' version of
>> PKRU inside the kernel (for each thread) and always update the task's
>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>> signal's PKRU state in to the main process's PKRU state when returning
>> from a signal.
> 
> Ick.  Or we could just declare that signals don't affect the PKRU
> state by default and mask it off in sigreturn.

Yeah, I've been messing with it in a few forms and it's pretty ugly.

I think it will be easier if we say the PKRU rights are not inherited by
signals and changes during a signal are tossed out.  Signal handlers are
special anyway and folks have to be careful writing them.

> In fact, maybe we should add a general xfeature (or whatever it's
> called these days) to the xstate in the signal context that controls
> which pieces are restored.  Then user code can tweak it if needed in
> signal handlers.

Yeah, that's probably a good idea long-term.  We're only getting more
and more things managed by XSAVE and it's going to be increasingly
interesting to glue real semantics back on top.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-21 Thread Andy Lutomirski
On Wed, Oct 21, 2015 at 12:11 PM, Dave Hansen  wrote:
> On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
>> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen  wrote:
>>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
  - Along similar considerations, also add a sys_pkey_query() system call 
 to query
the mapping of a specific pkey. (returns -EBADF or so if the key is not 
 mapped
at the moment.) This too could be vDSO accelerated in the future.

 I.e. something like:

  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
 init_val)
  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
  unsigned long sys_pkey_get   (int pkey)
  unsigned long sys_pkey_free  (int pkey)
>>>
>>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>>
>>> pkey_set() will modify the _current_ context's PKRU which includes the
>>> register itself and the kernel XSAVE buffer (if active).  But, since the
>>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>>> state set during the signal.
>>>
>>> I _think_ the right move here is to either keep a 'shadow' version of
>>> PKRU inside the kernel (for each thread) and always update the task's
>>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>>> signal's PKRU state in to the main process's PKRU state when returning
>>> from a signal.
>>
>> Ick.  Or we could just declare that signals don't affect the PKRU
>> state by default and mask it off in sigreturn.
>
> Yeah, I've been messing with it in a few forms and it's pretty ugly.
>
> I think it will be easier if we say the PKRU rights are not inherited by
> signals and changes during a signal are tossed out.  Signal handlers are
> special anyway and folks have to be careful writing them.

This is somewhat related to something I've been pondering in a
different context: fsbase and gsbase.

If a program changes fsbase using wrfsbase, should a signal handler
override it?  And should a change made in a signal handler carry over
after sigreturn?  Arguably, for fsbase and gsbase, the answer is no --
anyone use uses them for userspace threading (which is presumably why
they happened in the first place, even though userspace threading has
possibly dubious value) probably wants their context switches to stick
across signal invocations.

So I think that propagating PKRU into the signal handler and keeping
the in-register value on sigreturn by default is probably a reasonable
choice.

(OTOH, there's an argument for allowing programs to reset PKRU on
signal delivery: you could sort of arrange for signal handler to be
more privileged than the code that invokes them.  But that's doable
with some asm regardless.)

>
>> In fact, maybe we should add a general xfeature (or whatever it's
>> called these days) to the xstate in the signal context that controls
>> which pieces are restored.  Then user code can tweak it if needed in
>> signal handlers.
>
> Yeah, that's probably a good idea long-term.  We're only getting more
> and more things managed by XSAVE and it's going to be increasingly
> interesting to glue real semantics back on top.
>

Should we maybe extend copy_user_to_fpregs_zeroing to have a pair of
masks, where one mask indicates which features are copied and another
indicates which are preserved?  It looks like we already allow some
control over which bits are restored from sigcontext versus being
restored to their init state.

We might need to add some kind of extended ucontext area for this.  I
don't know if we're starting to run out of space.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-16 Thread Dave Hansen
On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - Along similar considerations, also add a sys_pkey_query() system call to 
> query 
>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
> mapped
>at the moment.) This too could be vDSO accelerated in the future.
> 
> I.e. something like:
> 
>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
> init_val)
>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>  unsigned long sys_pkey_get   (int pkey)
>  unsigned long sys_pkey_free  (int pkey)

The pkey_set() operation is going to get a wee bit interesting with signals.

pkey_set() will modify the _current_ context's PKRU which includes the
register itself and the kernel XSAVE buffer (if active).  But, since the
PKRU state is saved/restored with the XSAVE state, we will blow away any
state set during the signal.

I _think_ the right move here is to either keep a 'shadow' version of
PKRU inside the kernel (for each thread) and always update the task's
XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
signal's PKRU state in to the main process's PKRU state when returning
from a signal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-16 Thread Dave Hansen
On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - Along similar considerations, also add a sys_pkey_query() system call to 
> query 
>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
> mapped
>at the moment.) This too could be vDSO accelerated in the future.
> 
> I.e. something like:
> 
>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
> init_val)
>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>  unsigned long sys_pkey_get   (int pkey)
>  unsigned long sys_pkey_free  (int pkey)

The pkey_set() operation is going to get a wee bit interesting with signals.

pkey_set() will modify the _current_ context's PKRU which includes the
register itself and the kernel XSAVE buffer (if active).  But, since the
PKRU state is saved/restored with the XSAVE state, we will blow away any
state set during the signal.

I _think_ the right move here is to either keep a 'shadow' version of
PKRU inside the kernel (for each thread) and always update the task's
XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
signal's PKRU state in to the main process's PKRU state when returning
from a signal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Dave Hansen
On 10/07/2015 01:39 PM, Andy Lutomirski wrote:
> On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen  wrote:
>> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>>> Right now the native x86 PTE format allows two protection related bits for
>>> user-space pages:
>>>
>>>   _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then 
>>> it's read-write
>>>   _PAGE_BIT_NX:   if 0 the page is executable, if 1 then 
>>> it's not executable
>>>
>>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>>
>>> Another possibility would be 'true write-only (-w-)' mappings.
>>
>> How would those work?
>>
>> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
>> Access-Disable denies _all_ data access to the region.  There's no way
>> to allow only writes.
> 
> Weird.  I wonder why Intel did that.
> 
> I also wonder whether EPT can do write-only.

The SDM makes it look that way.  There appear to be completely separate
r/w/x bits.  r=0/w=0/x=0 means !present.

The bit 0 definition says, for instance:

Read access; indicates whether reads are allowed from the
4-KByte page referenced by this entry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Andy Lutomirski
On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen  wrote:
> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>> Right now the native x86 PTE format allows two protection related bits for
>> user-space pages:
>>
>>   _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then 
>> it's read-write
>>   _PAGE_BIT_NX:   if 0 the page is executable, if 1 then 
>> it's not executable
>>
>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>
>> Another possibility would be 'true write-only (-w-)' mappings.
>
> How would those work?
>
> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
> Access-Disable denies _all_ data access to the region.  There's no way
> to allow only writes.

Weird.  I wonder why Intel did that.

I also wonder whether EPT can do write-only.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Dave Hansen
On 10/03/2015 01:17 AM, Ingo Molnar wrote:
> Right now the native x86 PTE format allows two protection related bits for 
> user-space pages:
> 
>   _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then it's 
> read-write
>   _PAGE_BIT_NX:   if 0 the page is executable, if 1 then it's 
> not executable
> 
> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
> 
> Another possibility would be 'true write-only (-w-)' mappings.

How would those work?

Protection Keys has a Write-Disable and an Access-Disable bit.  But,
Access-Disable denies _all_ data access to the region.  There's no way
to allow only writes.

Or am I missing something?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
> >  - I'd also suggest providing an initial value with the 'alloc' call. It's 
> > true 
> >that user-space can do this itself in assembly, OTOH there's no reason 
> > not to 
> >provide a C interface for this.
> 
> You mean an initial value for the rights register (PKRU), correct?
> 
> So init_val would be something like
> 
>   PKEY_DENY_ACCESS
>   PKEY_DENY_WRITE
> 
> and it would refer only to the key that was allocated.

Correct.

> >  - Along similar considerations, also add a sys_pkey_query() system call to 
> > query 
> >the mapping of a specific pkey. (returns -EBADF or so if the key is not 
> > mapped
> >at the moment.) This too could be vDSO accelerated in the future.
> 
> Do you mean whether the key is being used on a mapping (VMA) or rather
> whether the key is currently allocated (has been returned from
> sys_pkey_alloc() in the past)?

So in my mind 'pkeys' are an array of 16 values. The hardware allows us to map 
any 
'protection key value' to any of the 16 indices.

The query interface would only query this array, i.e. it would tell us what 
current protection value a given pkey index has - if it's allocated. So 
sys_pkey_query(6) would return the current protection key value for index 6. If 
the index has not been allocated yet, it would return -EBADF or so.

This is what 'managed pkeys' means in essence.

Allocation/freeing of pkeys is a relatively rare operation, and pkeys get 
inherited across fork()/clone() (which further cuts down on management 
activities), but it looks simple in any case.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Andy Lutomirski
On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen  wrote:
> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>> Right now the native x86 PTE format allows two protection related bits for
>> user-space pages:
>>
>>   _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then 
>> it's read-write
>>   _PAGE_BIT_NX:   if 0 the page is executable, if 1 then 
>> it's not executable
>>
>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>
>> Another possibility would be 'true write-only (-w-)' mappings.
>
> How would those work?
>
> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
> Access-Disable denies _all_ data access to the region.  There's no way
> to allow only writes.

Weird.  I wonder why Intel did that.

I also wonder whether EPT can do write-only.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Dave Hansen
On 10/03/2015 01:17 AM, Ingo Molnar wrote:
> Right now the native x86 PTE format allows two protection related bits for 
> user-space pages:
> 
>   _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then it's 
> read-write
>   _PAGE_BIT_NX:   if 0 the page is executable, if 1 then it's 
> not executable
> 
> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
> 
> Another possibility would be 'true write-only (-w-)' mappings.

How would those work?

Protection Keys has a Write-Disable and an Access-Disable bit.  But,
Access-Disable denies _all_ data access to the region.  There's no way
to allow only writes.

Or am I missing something?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Dave Hansen
On 10/07/2015 01:39 PM, Andy Lutomirski wrote:
> On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen  wrote:
>> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>>> Right now the native x86 PTE format allows two protection related bits for
>>> user-space pages:
>>>
>>>   _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then 
>>> it's read-write
>>>   _PAGE_BIT_NX:   if 0 the page is executable, if 1 then 
>>> it's not executable
>>>
>>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>>
>>> Another possibility would be 'true write-only (-w-)' mappings.
>>
>> How would those work?
>>
>> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
>> Access-Disable denies _all_ data access to the region.  There's no way
>> to allow only writes.
> 
> Weird.  I wonder why Intel did that.
> 
> I also wonder whether EPT can do write-only.

The SDM makes it look that way.  There appear to be completely separate
r/w/x bits.  r=0/w=0/x=0 means !present.

The bit 0 definition says, for instance:

Read access; indicates whether reads are allowed from the
4-KByte page referenced by this entry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-07 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
> >  - I'd also suggest providing an initial value with the 'alloc' call. It's 
> > true 
> >that user-space can do this itself in assembly, OTOH there's no reason 
> > not to 
> >provide a C interface for this.
> 
> You mean an initial value for the rights register (PKRU), correct?
> 
> So init_val would be something like
> 
>   PKEY_DENY_ACCESS
>   PKEY_DENY_WRITE
> 
> and it would refer only to the key that was allocated.

Correct.

> >  - Along similar considerations, also add a sys_pkey_query() system call to 
> > query 
> >the mapping of a specific pkey. (returns -EBADF or so if the key is not 
> > mapped
> >at the moment.) This too could be vDSO accelerated in the future.
> 
> Do you mean whether the key is being used on a mapping (VMA) or rather
> whether the key is currently allocated (has been returned from
> sys_pkey_alloc() in the past)?

So in my mind 'pkeys' are an array of 16 values. The hardware allows us to map 
any 
'protection key value' to any of the 16 indices.

The query interface would only query this array, i.e. it would tell us what 
current protection value a given pkey index has - if it's allocated. So 
sys_pkey_query(6) would return the current protection key value for index 6. If 
the index has not been allocated yet, it would return -EBADF or so.

This is what 'managed pkeys' means in essence.

Allocation/freeing of pkeys is a relatively rare operation, and pkeys get 
inherited across fork()/clone() (which further cuts down on management 
activities), but it looks simple in any case.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-06 Thread Dave Hansen
On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - I'd also suggest providing an initial value with the 'alloc' call. It's 
> true 
>that user-space can do this itself in assembly, OTOH there's no reason not 
> to 
>provide a C interface for this.

You mean an initial value for the rights register (PKRU), correct?

So init_val would be something like

PKEY_DENY_ACCESS
PKEY_DENY_WRITE

and it would refer only to the key that was allocated.

>  - Along similar considerations, also add a sys_pkey_query() system call to 
> query 
>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
> mapped
>at the moment.) This too could be vDSO accelerated in the future.

Do you mean whether the key is being used on a mapping (VMA) or rather
whether the key is currently allocated (has been returned from
sys_pkey_alloc() in the past)?

> I.e. something like:
> 
>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
> init_val)
>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>  unsigned long sys_pkey_get   (int pkey)
>  unsigned long sys_pkey_free  (int pkey)
> 
> Optional suggestion:
> 
>  - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks 
> - 
>but I'm not sure about that: it looks expensive and complex, and a TID 
> argument 
>can always be added later if there's some real need.

Yeah, let's see how the stuff above looks first.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-06 Thread Dave Hansen
On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - I'd also suggest providing an initial value with the 'alloc' call. It's 
> true 
>that user-space can do this itself in assembly, OTOH there's no reason not 
> to 
>provide a C interface for this.

You mean an initial value for the rights register (PKRU), correct?

So init_val would be something like

PKEY_DENY_ACCESS
PKEY_DENY_WRITE

and it would refer only to the key that was allocated.

>  - Along similar considerations, also add a sys_pkey_query() system call to 
> query 
>the mapping of a specific pkey. (returns -EBADF or so if the key is not 
> mapped
>at the moment.) This too could be vDSO accelerated in the future.

Do you mean whether the key is being used on a mapping (VMA) or rather
whether the key is currently allocated (has been returned from
sys_pkey_alloc() in the past)?

> I.e. something like:
> 
>  unsigned long sys_pkey_alloc (unsigned long flags, unsigned long 
> init_val)
>  unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>  unsigned long sys_pkey_get   (int pkey)
>  unsigned long sys_pkey_free  (int pkey)
> 
> Optional suggestion:
> 
>  - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks 
> - 
>but I'm not sure about that: it looks expensive and complex, and a TID 
> argument 
>can always be added later if there's some real need.

Yeah, let's see how the stuff above looks first.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Dave Hansen  wrote:
> 
> > > Another question, related to enumeration as well: I'm wondering whether 
> > > there's any way for the kernel to allocate a bit or two for its own 
> > > purposes - 
> > > such as protecting crypto keys? Or is the facility fundamentally intended 
> > > for 
> > > user-space use only?
> > 
> > No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:

So, I'm wondering about the following additional usecase:

Right now the native x86 PTE format allows two protection related bits for 
user-space pages:

  _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then it's 
read-write
  _PAGE_BIT_NX:   if 0 the page is executable, if 1 then it's 
not executable

As discussed previously, pkeys allows 'true execute only (--x)' mappings.

Another possibility would be 'true write-only (-w-)' mappings.

This too could in theory be introduced 'transparently', via 'pure PROT_WRITE' 
mappings (i.e. no PROT_READ|PROT_EXEC bits set). Assuming the amount of 
user-space 
with implicit 'PROT_WRITE implies PROT_READ' assumptions is not unmanageble for 
a 
distro willing to try this.

Usage of this would be more limited than of pure PROT_EXEC mappings, but it's a 
nonzero set:

 - Write-only log buffers that are normally mmap()-ed from a file.

 - Write-only write() IO buffers that are only accessed via write().
   (kernel-space accesses ignore pkey values.)

   glibc's buffered IO might possibly make use of this, for write-only
   fopen()ed files.

 - Language runtimes could improve their security by eliminating W+X mappings 
of 
   JIT-ed code, instead they could use two alias mappings: one alias is a 
   true-exec (--x) mapping, the other (separately mapped, separately randomized)
   mapping is a true write-only (--x) mapping for generated code.

In addition to the security advantage, another advantage would be increased 
robustness: no accidental corruption of IO (or JIT) buffers via read-only 
codepaths.

Another advantage would be that it would utilize pkeys without having to teach 
applications to use new system calls.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 10/01/2015 11:23 PM, Ingo Molnar wrote:
> >> > Also, how do we do mprotect_pkey and say "don't change the key"?
> >
> > So if we start managing keys as a resource (i.e. alloc/free up to 16 of 
> > them), 
> > and provide APIs for user-space to do all that, then user-space is not 
> > supposed to touch keys it has not allocated for itself - just like it's not 
> > supposed to write to fds it has not opened.
> 
> I like that.  It gives us at least a "soft" indicator to userspace about what 
> keys it should or shouldn't be using.

Yes. A 16-bit allocation bitmap would solve this nicely.

> > Such an allocation method can still 'mess up', and if the kernel allocates 
> > a key 
> > for its purposes it should not assume that user-space cannot change it, but 
> > at 
> > least for non-buggy code there's no interaction and it would work out fine.
> 
> Yeah.  It also provides a clean interface so that future hardware could
> enforce enforce kernel "ownership" of a key which could protect against
> even buggy code.
> 
> So, we add a pair of syscalls,
> 
>   unsigned long sys_alloc_pkey(unsigned long flags??)
>   unsigned long sys_free_pkey(unsigned long pkey)
> 
> keep the metadata in the mm, and then make sure that userspace allocated
> it before it is allowed to do an mprotect_pkey() with it.

Yeah, so such an interface would allow the clean, transparent usage of pkeys 
for 
pure PROT_EXEC mappings.

I'd expect the --x/PROT_EXEC mappings to be _by far_ more frequently used than 
pure pkeys - but we still need the management interface to keep the kernel's 
use 
of pkeys separate from user-space's use.

If all the necessary tooling changes are propagated through then in fact I'd 
expect every pkeys capable Linux system to use pkeys, for almost every 
user-space 
task.

To have maximum future flexibility for pkeys I'd suggest the following 
additional 
changes to the syscall ABI:

 - Please name them with a pkey_ prefix, along the sys_pkey_* nomenclature, so 
   that it becomes an easily identified 'family' of system calls.

 - I'd also suggest providing an initial value with the 'alloc' call. It's true 
   that user-space can do this itself in assembly, OTOH there's no reason not 
to 
   provide a C interface for this.

 - Make the pkey identifier 'int', not 'long', like fds are. There's very little
   expectation to ever have more than 4 billion pkeys per mm, right?

 - How far do we want the kernel to manage this? Any reason we don't want a
   'set pkey' operation, if user-space wants to use pure C interfaces? That 
could 
   be vDSO accelerated as well, to use the unprivileged op. An advantage of such
   an interface would be that it would enable the kernel to more actively manage
   the actual mappings as well in the future: for example to automatically not
   allow accidental RWX mappings. Such an interface would also allow the future
   introduction of privileged pkey mappings on the hardware side, without having
   to change user-space, since everything goes via the kernel interface.

 - Along similar considerations, also add a sys_pkey_query() system call to 
query 
   the mapping of a specific pkey. (returns -EBADF or so if the key is not 
mapped
   at the moment.) This too could be vDSO accelerated in the future.

I.e. something like:

 unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
 unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
 unsigned long sys_pkey_get   (int pkey)
 unsigned long sys_pkey_free  (int pkey)

Optional suggestion:

 - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks - 
   but I'm not sure about that: it looks expensive and complex, and a TID 
argument 
   can always be added later if there's some real need.

> That should be pretty easy to implement.  The only real overhead is the 16 
> bits 
> we need to keep in the mm somewhere.

Yes.

Note that if we use the C syscall interface suggestions I outlined above, we 
could 
in the future also change to have a full table, and manage it explicitly - 
without 
user-space changes - if the hardware side is tweaked to allow kernel side pkeys.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > It's quite likely that you will find that compilers put read-only constants 
> > in 
> > the text section, knowing that executable means readable.
> 
> At least with pkeys enabling true --x mappings, that compiler practice 
> becomes a 
> (mild) security problem: it provides a readable and executable return target 
> for 
> stack/buffer overflow attacks - FWIIW. (It's a limited concern because the 
> true 
> code areas are executable already.)

Btw., it's not just security, there will also a robustness advantage to 
creating 
true PROT_EXEC mappings: right now if buggy user-space code accidentally 
references into an executable section: say uses a negative index in a table put 
into .rodata, the code will not crash, it will happily read from the .text area.

But if we mapped .text with true PROT_EXEC (and the CPU enforced that) then 
we'd 
get a nice segfault.

This has additional security benefits as well, beyond not providing readable 
ROP 
sites - which in fact look more significant than the ROP readability angle I 
mentioned initially.

So to sum it up, if we use true --x (non-readable PROT_EXEC) mappings using 
pkeys, 
we get the following benefits:

 - Overflows and other out of bounds accesses from .rodata (and other data
   sections near .text) will be caught by the CPU instead of silent data flow 
   corruption. This has robustness (and thus security) advantages.

 - True --x code is not readable, thus not 'soft-discoverable' via information 
   leaks for ROP purposes.

 - The version fingerprinting of unknown remote target binaries via information 
   leaks becomes harder as well.

 - The local (and remote) guessing of ASLR offsets via information leaks gets
   harder as well.

 - We get to test pkeys much more seriously than the opt-in special uses! :-)

Intel sent me pkeys test hardware, so I can give it a go in practice as well 
and 
see how well it works.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Paolo Bonzini  wrote:

> 
> 
> On 02/10/2015 13:58, Linus Torvalds wrote:
> > On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini  wrote:
> >> On 02/10/2015 00:48, Linus Torvalds wrote:
> >>> It's quite likely that you will find that compilers put read-only
> >>> constants in the text section, knowing that executable means readable.
> >>
> >> Not on x86 (because it has large immediates; RISC machines and s390 do
> >> put large constants in the text section).
> >>
> >> But at the very least jump tables reside in the .text seection.
> > 
> > Yes, at least traditionally gcc put things like the jump tables for
> > switch() statements immediately next to the code. That caused lots of
> > pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> > caused gcc to then put the jump tables further away, and it might be
> > in a separate section these days - but it might also just be
> > "sufficiently aligned" that the L1 cache issue isn't in play any more.
> > 
> > Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> > rest easy knowing that the data accesses and text accesses should be
> > separated by at least one cacheline (maybe even 128 bytes - I think
> > the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> > bigger line - but that might have been in the L2 only).
> > 
> > But I could easily see the compiler/linker still putting them in the
> > same ELF segment.
> 
> You're entirely right, it puts them in .rodata actually.  But .rodata is
> in the same segment as .text:
> 
> $ readelf --segments /bin/true
> ...
>  Section to Segment mapping:
>   Segment Sections...
>00 
>01 .interp 
>02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
>   .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
>   .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
>03 .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
>04 .dynamic 
>05 .note.ABI-tag .note.gnu.build-id 
>06 .eh_frame_hdr 
>07 
>08 .init_array .fini_array .jcr .data.rel.ro .dynamic .got 

Is there an easy(-ish) way (i.e. using compiler/linker flags, not linker 
scripts) 
to build the ELF binary in such a way so that non-code data:

  .rodata .eh_frame_hdr .eh_frame 

... gets put into a separate (readonly and non-executable) segment? That would 
enable things from the distro side AFAICS, right?

(assuming I'm reading the ELF dump right.)

Or does this need binutils surgery?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Paolo Bonzini  wrote:

> 
> 
> On 02/10/2015 13:58, Linus Torvalds wrote:
> > On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini  wrote:
> >> On 02/10/2015 00:48, Linus Torvalds wrote:
> >>> It's quite likely that you will find that compilers put read-only
> >>> constants in the text section, knowing that executable means readable.
> >>
> >> Not on x86 (because it has large immediates; RISC machines and s390 do
> >> put large constants in the text section).
> >>
> >> But at the very least jump tables reside in the .text seection.
> > 
> > Yes, at least traditionally gcc put things like the jump tables for
> > switch() statements immediately next to the code. That caused lots of
> > pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> > caused gcc to then put the jump tables further away, and it might be
> > in a separate section these days - but it might also just be
> > "sufficiently aligned" that the L1 cache issue isn't in play any more.
> > 
> > Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> > rest easy knowing that the data accesses and text accesses should be
> > separated by at least one cacheline (maybe even 128 bytes - I think
> > the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> > bigger line - but that might have been in the L2 only).
> > 
> > But I could easily see the compiler/linker still putting them in the
> > same ELF segment.
> 
> You're entirely right, it puts them in .rodata actually.  But .rodata is
> in the same segment as .text:
> 
> $ readelf --segments /bin/true
> ...
>  Section to Segment mapping:
>   Segment Sections...
>00 
>01 .interp 
>02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
>   .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
>   .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
>03 .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
>04 .dynamic 
>05 .note.ABI-tag .note.gnu.build-id 
>06 .eh_frame_hdr 
>07 
>08 .init_array .fini_array .jcr .data.rel.ro .dynamic .got 

Is there an easy(-ish) way (i.e. using compiler/linker flags, not linker 
scripts) 
to build the ELF binary in such a way so that non-code data:

  .rodata .eh_frame_hdr .eh_frame 

... gets put into a separate (readonly and non-executable) segment? That would 
enable things from the distro side AFAICS, right?

(assuming I'm reading the ELF dump right.)

Or does this need binutils surgery?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > It's quite likely that you will find that compilers put read-only constants 
> > in 
> > the text section, knowing that executable means readable.
> 
> At least with pkeys enabling true --x mappings, that compiler practice 
> becomes a 
> (mild) security problem: it provides a readable and executable return target 
> for 
> stack/buffer overflow attacks - FWIIW. (It's a limited concern because the 
> true 
> code areas are executable already.)

Btw., it's not just security, there will also a robustness advantage to 
creating 
true PROT_EXEC mappings: right now if buggy user-space code accidentally 
references into an executable section: say uses a negative index in a table put 
into .rodata, the code will not crash, it will happily read from the .text area.

But if we mapped .text with true PROT_EXEC (and the CPU enforced that) then 
we'd 
get a nice segfault.

This has additional security benefits as well, beyond not providing readable 
ROP 
sites - which in fact look more significant than the ROP readability angle I 
mentioned initially.

So to sum it up, if we use true --x (non-readable PROT_EXEC) mappings using 
pkeys, 
we get the following benefits:

 - Overflows and other out of bounds accesses from .rodata (and other data
   sections near .text) will be caught by the CPU instead of silent data flow 
   corruption. This has robustness (and thus security) advantages.

 - True --x code is not readable, thus not 'soft-discoverable' via information 
   leaks for ROP purposes.

 - The version fingerprinting of unknown remote target binaries via information 
   leaks becomes harder as well.

 - The local (and remote) guessing of ASLR offsets via information leaks gets
   harder as well.

 - We get to test pkeys much more seriously than the opt-in special uses! :-)

Intel sent me pkeys test hardware, so I can give it a go in practice as well 
and 
see how well it works.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 10/01/2015 11:23 PM, Ingo Molnar wrote:
> >> > Also, how do we do mprotect_pkey and say "don't change the key"?
> >
> > So if we start managing keys as a resource (i.e. alloc/free up to 16 of 
> > them), 
> > and provide APIs for user-space to do all that, then user-space is not 
> > supposed to touch keys it has not allocated for itself - just like it's not 
> > supposed to write to fds it has not opened.
> 
> I like that.  It gives us at least a "soft" indicator to userspace about what 
> keys it should or shouldn't be using.

Yes. A 16-bit allocation bitmap would solve this nicely.

> > Such an allocation method can still 'mess up', and if the kernel allocates 
> > a key 
> > for its purposes it should not assume that user-space cannot change it, but 
> > at 
> > least for non-buggy code there's no interaction and it would work out fine.
> 
> Yeah.  It also provides a clean interface so that future hardware could
> enforce enforce kernel "ownership" of a key which could protect against
> even buggy code.
> 
> So, we add a pair of syscalls,
> 
>   unsigned long sys_alloc_pkey(unsigned long flags??)
>   unsigned long sys_free_pkey(unsigned long pkey)
> 
> keep the metadata in the mm, and then make sure that userspace allocated
> it before it is allowed to do an mprotect_pkey() with it.

Yeah, so such an interface would allow the clean, transparent usage of pkeys 
for 
pure PROT_EXEC mappings.

I'd expect the --x/PROT_EXEC mappings to be _by far_ more frequently used than 
pure pkeys - but we still need the management interface to keep the kernel's 
use 
of pkeys separate from user-space's use.

If all the necessary tooling changes are propagated through then in fact I'd 
expect every pkeys capable Linux system to use pkeys, for almost every 
user-space 
task.

To have maximum future flexibility for pkeys I'd suggest the following 
additional 
changes to the syscall ABI:

 - Please name them with a pkey_ prefix, along the sys_pkey_* nomenclature, so 
   that it becomes an easily identified 'family' of system calls.

 - I'd also suggest providing an initial value with the 'alloc' call. It's true 
   that user-space can do this itself in assembly, OTOH there's no reason not 
to 
   provide a C interface for this.

 - Make the pkey identifier 'int', not 'long', like fds are. There's very little
   expectation to ever have more than 4 billion pkeys per mm, right?

 - How far do we want the kernel to manage this? Any reason we don't want a
   'set pkey' operation, if user-space wants to use pure C interfaces? That 
could 
   be vDSO accelerated as well, to use the unprivileged op. An advantage of such
   an interface would be that it would enable the kernel to more actively manage
   the actual mappings as well in the future: for example to automatically not
   allow accidental RWX mappings. Such an interface would also allow the future
   introduction of privileged pkey mappings on the hardware side, without having
   to change user-space, since everything goes via the kernel interface.

 - Along similar considerations, also add a sys_pkey_query() system call to 
query 
   the mapping of a specific pkey. (returns -EBADF or so if the key is not 
mapped
   at the moment.) This too could be vDSO accelerated in the future.

I.e. something like:

 unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
 unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
 unsigned long sys_pkey_get   (int pkey)
 unsigned long sys_pkey_free  (int pkey)

Optional suggestion:

 - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks - 
   but I'm not sure about that: it looks expensive and complex, and a TID 
argument 
   can always be added later if there's some real need.

> That should be pretty easy to implement.  The only real overhead is the 16 
> bits 
> we need to keep in the mm somewhere.

Yes.

Note that if we use the C syscall interface suggestions I outlined above, we 
could 
in the future also change to have a full table, and manage it explicitly - 
without 
user-space changes - if the hardware side is tweaked to allow kernel side pkeys.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-03 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Dave Hansen  wrote:
> 
> > > Another question, related to enumeration as well: I'm wondering whether 
> > > there's any way for the kernel to allocate a bit or two for its own 
> > > purposes - 
> > > such as protecting crypto keys? Or is the facility fundamentally intended 
> > > for 
> > > user-space use only?
> > 
> > No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:

So, I'm wondering about the following additional usecase:

Right now the native x86 PTE format allows two protection related bits for 
user-space pages:

  _PAGE_BIT_RW:   if 0 the page is read-only,  if 1 then it's 
read-write
  _PAGE_BIT_NX:   if 0 the page is executable, if 1 then it's 
not executable

As discussed previously, pkeys allows 'true execute only (--x)' mappings.

Another possibility would be 'true write-only (-w-)' mappings.

This too could in theory be introduced 'transparently', via 'pure PROT_WRITE' 
mappings (i.e. no PROT_READ|PROT_EXEC bits set). Assuming the amount of 
user-space 
with implicit 'PROT_WRITE implies PROT_READ' assumptions is not unmanageble for 
a 
distro willing to try this.

Usage of this would be more limited than of pure PROT_EXEC mappings, but it's a 
nonzero set:

 - Write-only log buffers that are normally mmap()-ed from a file.

 - Write-only write() IO buffers that are only accessed via write().
   (kernel-space accesses ignore pkey values.)

   glibc's buffered IO might possibly make use of this, for write-only
   fopen()ed files.

 - Language runtimes could improve their security by eliminating W+X mappings 
of 
   JIT-ed code, instead they could use two alias mappings: one alias is a 
   true-exec (--x) mapping, the other (separately mapped, separately randomized)
   mapping is a true write-only (--x) mapping for generated code.

In addition to the security advantage, another advantage would be increased 
robustness: no accidental corruption of IO (or JIT) buffers via read-only 
codepaths.

Another advantage would be that it would utilize pkeys without having to teach 
applications to use new system calls.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Dave Hansen
On 10/01/2015 06:38 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen  wrote:
>>
>> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
>> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
>> but aren't _really_ going to read it, so we can safely deny them all
>> access other than exec.
> 
> That's a completely insane assumption. There are tons of reasons to
> have code and read-only data in the same segment, and it's very
> traditional. Just assuming that you only execute out of something that
> has PROT_EXEC | PROT_READ is insane.

Yes, it's insane, and I confirmed that ld.so actually reads some stuff
out of the first page of the r-x part of the executable.

But, it did find a bug in my code where I wouldn't allow instruction
fetches to fault in pages in a pkey-protected area, so it wasn't a
completely worthless exercise.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Dave Hansen
On 10/01/2015 11:23 PM, Ingo Molnar wrote:
>> > Also, how do we do mprotect_pkey and say "don't change the key"?
> So if we start managing keys as a resource (i.e. alloc/free up to 16 of 
> them), and 
> provide APIs for user-space to do all that, then user-space is not supposed 
> to 
> touch keys it has not allocated for itself - just like it's not supposed to 
> write 
> to fds it has not opened.

I like that.  It gives us at least a "soft" indicator to userspace about
what keys it should or shouldn't be using.

> Such an allocation method can still 'mess up', and if the kernel allocates a 
> key 
> for its purposes it should not assume that user-space cannot change it, but 
> at 
> least for non-buggy code there's no interaction and it would work out fine.

Yeah.  It also provides a clean interface so that future hardware could
enforce enforce kernel "ownership" of a key which could protect against
even buggy code.

So, we add a pair of syscalls,

unsigned long sys_alloc_pkey(unsigned long flags??)
unsigned long sys_free_pkey(unsigned long pkey)

keep the metadata in the mm, and then make sure that userspace allocated
it before it is allowed to do an mprotect_pkey() with it.

mprotect_pkey(add, flags, pkey)
{
if (!(mm->pkeys_allocated & (1 << pkey))
return -EINVAL;
}

That should be pretty easy to implement.  The only real overhead is the
16 bits we need to keep in the mm somewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Paolo Bonzini


On 02/10/2015 13:58, Linus Torvalds wrote:
> On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini  wrote:
>> On 02/10/2015 00:48, Linus Torvalds wrote:
>>> It's quite likely that you will find that compilers put read-only
>>> constants in the text section, knowing that executable means readable.
>>
>> Not on x86 (because it has large immediates; RISC machines and s390 do
>> put large constants in the text section).
>>
>> But at the very least jump tables reside in the .text seection.
> 
> Yes, at least traditionally gcc put things like the jump tables for
> switch() statements immediately next to the code. That caused lots of
> pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> caused gcc to then put the jump tables further away, and it might be
> in a separate section these days - but it might also just be
> "sufficiently aligned" that the L1 cache issue isn't in play any more.
> 
> Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> rest easy knowing that the data accesses and text accesses should be
> separated by at least one cacheline (maybe even 128 bytes - I think
> the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> bigger line - but that might have been in the L2 only).
> 
> But I could easily see the compiler/linker still putting them in the
> same ELF segment.

You're entirely right, it puts them in .rodata actually.  But .rodata is
in the same segment as .text:

$ readelf --segments /bin/true
...
 Section to Segment mapping:
  Segment Sections...
   00 
   01 .interp 
   02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
  .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
  .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03 .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
   04 .dynamic 
   05 .note.ABI-tag .note.gnu.build-id 
   06 .eh_frame_hdr 
   07 
   08 .init_array .fini_array .jcr .data.rel.ro .dynamic .got 


Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Linus Torvalds
On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini  wrote:
> On 02/10/2015 00:48, Linus Torvalds wrote:
>> It's quite likely that you will find that compilers put read-only
>> constants in the text section, knowing that executable means readable.
>
> Not on x86 (because it has large immediates; RISC machines and s390 do
> put large constants in the text section).
>
> But at the very least jump tables reside in the .text seection.

Yes, at least traditionally gcc put things like the jump tables for
switch() statements immediately next to the code. That caused lots of
pain on the P4, where the L1 I$ and D$ were exclusive. I think that
caused gcc to then put the jump tables further away, and it might be
in a separate section these days - but it might also just be
"sufficiently aligned" that the L1 cache issue isn't in play any more.

Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
rest easy knowing that the data accesses and text accesses should be
separated by at least one cacheline (maybe even 128 bytes - I think
the L4 used 64-byte line size, but it was sub-sections of a 128-byte
bigger line - but that might have been in the L2 only).

But I could easily see the compiler/linker still putting them in the
same ELF segment.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Paolo Bonzini


On 02/10/2015 00:48, Linus Torvalds wrote:
> It's quite likely that you will find that compilers put read-only
> constants in the text section, knowing that executable means readable.

Not on x86 (because it has large immediates; RISC machines and s390 do
put large constants in the text section).

But at the very least jump tables reside in the .text seection.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen  wrote:
> >
> > Here it is in a quite fugly form (well, it's not opt-in).  Init crashes if 
> > I 
> > boot with this, though.
> >
> > I'll see if I can turn it in to a bit more of an opt-in and see what's 
> > actually going wrong.
> 
> It's quite likely that you will find that compilers put read-only constants 
> in 
> the text section, knowing that executable means readable.

At least with pkeys enabling true --x mappings, that compiler practice becomes 
a 
(mild) security problem: it provides a readable and executable return target 
for 
stack/buffer overflow attacks - FWIIW. (It's a limited concern because the true 
code areas are executable already.)

I'd expect such readonly data to eventually move out into the regular data 
sections, the moment the kernel gives a tool to distros to enforce true 
PROT_EXEC 
mappings.

> So it's entirely possible that it's pretty much all over.

I'd expect that too.

> That said, I don't understand your patch. Why check PROT_WRITE? We've had
> :"execute but not write" forever. It's "execute and not *read*" that is
> interesting.

Yeah, but almost none of user-space seems to be using it.

> So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

I don't think you are mis-reading it: my (hacky! bad! not signed off!) debug 
idea 
was to fudge PROT_EXEC|PROT_READ bits into pure PROT_EXEC only - at least to 
get 
pkeys used in a much more serious fashion than standalone testcases, without 
having to change the distro itself.

You are probably right that true data reads from executable sections are very 
common, so this might not be a viable technique even for testing purposes.

But worth a try.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> >> Assuming it boots up fine on a typical distro, i.e. assuming that there 
> >> are no
> >> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
> >
> > I can't wait to find out what implicitly expects PROT_READ from
> > PROT_EXEC mappings. :)

So what seems to happen is that there are no pure PROT_EXEC mappings in 
practice - 
they are only omnibus PROT_READ|PROT_EXEC mappings, an unknown proportion of 
which 
truly relies on PROT_READ:

  $ for C in firefox ls perf libreoffice google-chrome Xorg xterm \
  konsole; do echo; echo "# $C:"; strace -e trace=mmap -f $C -h 2>&1 | cut 
-d, -f3 | \
  grep PROT | sort | uniq -c; done

# firefox:
 13  PROT_READ
 82  PROT_READ|PROT_EXEC
184  PROT_READ|PROT_WRITE
  2  PROT_READ|PROT_WRITE|PROT_EXEC

# ls:
  2  PROT_READ
  7  PROT_READ|PROT_EXEC
 17  PROT_READ|PROT_WRITE

# perf:
  1  PROT_READ
 20  PROT_READ|PROT_EXEC
 44  PROT_READ|PROT_WRITE

# libreoffice:
  2  PROT_NONE
 87  PROT_READ
148  PROT_READ|PROT_EXEC
339  PROT_READ|PROT_WRITE

# google-chrome:
 39  PROT_READ
121  PROT_READ|PROT_EXEC
345  PROT_READ|PROT_WRITE

# Xorg:
  1  PROT_READ
 22  PROT_READ|PROT_EXEC
 39  PROT_READ|PROT_WRITE

# xterm:
  1  PROT_READ
 25  PROT_READ|PROT_EXEC
 46  PROT_READ|PROT_WRITE

# konsole:
  1  PROT_READ
101  PROT_READ|PROT_EXEC
175  PROT_READ|PROT_WRITE

So whatever kernel side method we come up with, it's not something that I 
expect 
to become production quality. "Proper" conversion to pkeys has to be driven 
from 
the user-space side.

That does not mean we can not try! :-)

> There's one annoying issue at least:
> 
> mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
> mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
> whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
> do?  What if the caller actually wants key 0?  What if some CPU vendor
> some day implements --x for real?

That comes from the hardcoded "user-space has 4 bits to itself, not managed by 
the 
kernel" assumption in the whole design. So no layering between different 
user-space libraries using pkeys in a different fashion, no transparent kernel 
use 
of pkeys (such as it may be), etc.

I'm not sure it's _worth_ managing these 4 bits, but '16 separate keys' does 
seem 
to be to me above a certain resource threshold that should be more explicitly 
managed than telling user-space: "it's all yours!".

> Also, how do we do mprotect_pkey and say "don't change the key"?

So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), 
and 
provide APIs for user-space to do all that, then user-space is not supposed to 
touch keys it has not allocated for itself - just like it's not supposed to 
write 
to fds it has not opened.

Such an allocation method can still 'mess up', and if the kernel allocates a 
key 
for its purposes it should not assume that user-space cannot change it, but at 
least for non-buggy code there's no interaction and it would work out fine.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 10/01/2015 01:39 PM, Kees Cook wrote:
> > On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
> >> So could we try to add an (opt-in) kernel option that enables this 
> >> transparently
> >> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> >> user-space changes and syscalls necessary?
> > 
> > I would like this very much. :)
> 
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
> 
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

So the reality of modern Linux distros is that, according to some limited 
strace-ing around, pure PROT_EXEC usage does not seem to exist: 99% of 
executable 
mappings are mapped via PROT_EXEC|PROT_READ.

So the most usable kernel testing approach would be to enable these types of 
pkeys 
for a child task via some mechanism and inherit it to all children (including 
inheriting it over non-suid exec) - but not to any other task.

You could hijack a new personality bit just for debug purposes - see the 
(totally 
untested) patch below.

Depending on user-space's assumptions it might not end up being anything usable 
we 
can apply, but it would be a great testing tool if it worked to a certain 
degree.

I.e. allow the system to boot in without pkeys set for any task, then set the 
personality of a shell process to PER_LINUX_PKEYS and see which binaries (if 
any!) 
will start up without segfaulting.

This way you don't have to debug SystemD, which is extremely fragile and 
passive-aggressive towards kernels that don't behave in precisely the fashion 
under which SystemD is being developed.

Thanks,

Ingo

>

Absolutely-Not-Signed-off-by: Ingo Molnar 

 include/uapi/linux/personality.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index aa169c4339d2..bead47213419 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -8,6 +8,7 @@
  * These occupy the top three bytes.
  */
 enum {
+   PROT_READ_EXEC_HACK =   0x001,  /* PROT_READ|PROT_EXEC == 
PROT_EXEC hack */
UNAME26 =   0x002,
ADDR_NO_RANDOMIZE = 0x004,  /* disable randomization of VA 
space */
FDPIC_FUNCPTRS =0x008,  /* userspace function ptrs 
point to descriptors
@@ -41,6 +42,7 @@ enum {
 enum {
PER_LINUX = 0x,
PER_LINUX_32BIT =   0x | ADDR_LIMIT_32BIT,
+   PER_LINUX_PKEYS =   0x | PROT_READ_EXEC_HACK,
PER_LINUX_FDPIC =   0x | FDPIC_FUNCPTRS,
PER_SVR4 =  0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
PER_SVR3 =  0x0002 | STICKY_TIMEOUTS | SHORT_INODE,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 10/01/2015 01:39 PM, Kees Cook wrote:
> > On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
> >> So could we try to add an (opt-in) kernel option that enables this 
> >> transparently
> >> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> >> user-space changes and syscalls necessary?
> > 
> > I would like this very much. :)
> 
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
> 
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

So the reality of modern Linux distros is that, according to some limited 
strace-ing around, pure PROT_EXEC usage does not seem to exist: 99% of 
executable 
mappings are mapped via PROT_EXEC|PROT_READ.

So the most usable kernel testing approach would be to enable these types of 
pkeys 
for a child task via some mechanism and inherit it to all children (including 
inheriting it over non-suid exec) - but not to any other task.

You could hijack a new personality bit just for debug purposes - see the 
(totally 
untested) patch below.

Depending on user-space's assumptions it might not end up being anything usable 
we 
can apply, but it would be a great testing tool if it worked to a certain 
degree.

I.e. allow the system to boot in without pkeys set for any task, then set the 
personality of a shell process to PER_LINUX_PKEYS and see which binaries (if 
any!) 
will start up without segfaulting.

This way you don't have to debug SystemD, which is extremely fragile and 
passive-aggressive towards kernels that don't behave in precisely the fashion 
under which SystemD is being developed.

Thanks,

Ingo

>

Absolutely-Not-Signed-off-by: Ingo Molnar 

 include/uapi/linux/personality.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index aa169c4339d2..bead47213419 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -8,6 +8,7 @@
  * These occupy the top three bytes.
  */
 enum {
+   PROT_READ_EXEC_HACK =   0x001,  /* PROT_READ|PROT_EXEC == 
PROT_EXEC hack */
UNAME26 =   0x002,
ADDR_NO_RANDOMIZE = 0x004,  /* disable randomization of VA 
space */
FDPIC_FUNCPTRS =0x008,  /* userspace function ptrs 
point to descriptors
@@ -41,6 +42,7 @@ enum {
 enum {
PER_LINUX = 0x,
PER_LINUX_32BIT =   0x | ADDR_LIMIT_32BIT,
+   PER_LINUX_PKEYS =   0x | PROT_READ_EXEC_HACK,
PER_LINUX_FDPIC =   0x | FDPIC_FUNCPTRS,
PER_SVR4 =  0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
PER_SVR3 =  0x0002 | STICKY_TIMEOUTS | SHORT_INODE,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> >> Assuming it boots up fine on a typical distro, i.e. assuming that there 
> >> are no
> >> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
> >
> > I can't wait to find out what implicitly expects PROT_READ from
> > PROT_EXEC mappings. :)

So what seems to happen is that there are no pure PROT_EXEC mappings in 
practice - 
they are only omnibus PROT_READ|PROT_EXEC mappings, an unknown proportion of 
which 
truly relies on PROT_READ:

  $ for C in firefox ls perf libreoffice google-chrome Xorg xterm \
  konsole; do echo; echo "# $C:"; strace -e trace=mmap -f $C -h 2>&1 | cut 
-d, -f3 | \
  grep PROT | sort | uniq -c; done

# firefox:
 13  PROT_READ
 82  PROT_READ|PROT_EXEC
184  PROT_READ|PROT_WRITE
  2  PROT_READ|PROT_WRITE|PROT_EXEC

# ls:
  2  PROT_READ
  7  PROT_READ|PROT_EXEC
 17  PROT_READ|PROT_WRITE

# perf:
  1  PROT_READ
 20  PROT_READ|PROT_EXEC
 44  PROT_READ|PROT_WRITE

# libreoffice:
  2  PROT_NONE
 87  PROT_READ
148  PROT_READ|PROT_EXEC
339  PROT_READ|PROT_WRITE

# google-chrome:
 39  PROT_READ
121  PROT_READ|PROT_EXEC
345  PROT_READ|PROT_WRITE

# Xorg:
  1  PROT_READ
 22  PROT_READ|PROT_EXEC
 39  PROT_READ|PROT_WRITE

# xterm:
  1  PROT_READ
 25  PROT_READ|PROT_EXEC
 46  PROT_READ|PROT_WRITE

# konsole:
  1  PROT_READ
101  PROT_READ|PROT_EXEC
175  PROT_READ|PROT_WRITE

So whatever kernel side method we come up with, it's not something that I 
expect 
to become production quality. "Proper" conversion to pkeys has to be driven 
from 
the user-space side.

That does not mean we can not try! :-)

> There's one annoying issue at least:
> 
> mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
> mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
> whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
> do?  What if the caller actually wants key 0?  What if some CPU vendor
> some day implements --x for real?

That comes from the hardcoded "user-space has 4 bits to itself, not managed by 
the 
kernel" assumption in the whole design. So no layering between different 
user-space libraries using pkeys in a different fashion, no transparent kernel 
use 
of pkeys (such as it may be), etc.

I'm not sure it's _worth_ managing these 4 bits, but '16 separate keys' does 
seem 
to be to me above a certain resource threshold that should be more explicitly 
managed than telling user-space: "it's all yours!".

> Also, how do we do mprotect_pkey and say "don't change the key"?

So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), 
and 
provide APIs for user-space to do all that, then user-space is not supposed to 
touch keys it has not allocated for itself - just like it's not supposed to 
write 
to fds it has not opened.

Such an allocation method can still 'mess up', and if the kernel allocates a 
key 
for its purposes it should not assume that user-space cannot change it, but at 
least for non-buggy code there's no interaction and it would work out fine.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Paolo Bonzini


On 02/10/2015 00:48, Linus Torvalds wrote:
> It's quite likely that you will find that compilers put read-only
> constants in the text section, knowing that executable means readable.

Not on x86 (because it has large immediates; RISC machines and s390 do
put large constants in the text section).

But at the very least jump tables reside in the .text seection.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Paolo Bonzini


On 02/10/2015 13:58, Linus Torvalds wrote:
> On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini  wrote:
>> On 02/10/2015 00:48, Linus Torvalds wrote:
>>> It's quite likely that you will find that compilers put read-only
>>> constants in the text section, knowing that executable means readable.
>>
>> Not on x86 (because it has large immediates; RISC machines and s390 do
>> put large constants in the text section).
>>
>> But at the very least jump tables reside in the .text seection.
> 
> Yes, at least traditionally gcc put things like the jump tables for
> switch() statements immediately next to the code. That caused lots of
> pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> caused gcc to then put the jump tables further away, and it might be
> in a separate section these days - but it might also just be
> "sufficiently aligned" that the L1 cache issue isn't in play any more.
> 
> Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> rest easy knowing that the data accesses and text accesses should be
> separated by at least one cacheline (maybe even 128 bytes - I think
> the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> bigger line - but that might have been in the L2 only).
> 
> But I could easily see the compiler/linker still putting them in the
> same ELF segment.

You're entirely right, it puts them in .rodata actually.  But .rodata is
in the same segment as .text:

$ readelf --segments /bin/true
...
 Section to Segment mapping:
  Segment Sections...
   00 
   01 .interp 
   02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
  .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
  .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03 .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
   04 .dynamic 
   05 .note.ABI-tag .note.gnu.build-id 
   06 .eh_frame_hdr 
   07 
   08 .init_array .fini_array .jcr .data.rel.ro .dynamic .got 


Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Linus Torvalds
On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini  wrote:
> On 02/10/2015 00:48, Linus Torvalds wrote:
>> It's quite likely that you will find that compilers put read-only
>> constants in the text section, knowing that executable means readable.
>
> Not on x86 (because it has large immediates; RISC machines and s390 do
> put large constants in the text section).
>
> But at the very least jump tables reside in the .text seection.

Yes, at least traditionally gcc put things like the jump tables for
switch() statements immediately next to the code. That caused lots of
pain on the P4, where the L1 I$ and D$ were exclusive. I think that
caused gcc to then put the jump tables further away, and it might be
in a separate section these days - but it might also just be
"sufficiently aligned" that the L1 cache issue isn't in play any more.

Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
rest easy knowing that the data accesses and text accesses should be
separated by at least one cacheline (maybe even 128 bytes - I think
the L4 used 64-byte line size, but it was sub-sections of a 128-byte
bigger line - but that might have been in the L2 only).

But I could easily see the compiler/linker still putting them in the
same ELF segment.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Dave Hansen
On 10/01/2015 06:38 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen  wrote:
>>
>> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
>> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
>> but aren't _really_ going to read it, so we can safely deny them all
>> access other than exec.
> 
> That's a completely insane assumption. There are tons of reasons to
> have code and read-only data in the same segment, and it's very
> traditional. Just assuming that you only execute out of something that
> has PROT_EXEC | PROT_READ is insane.

Yes, it's insane, and I confirmed that ld.so actually reads some stuff
out of the first page of the r-x part of the executable.

But, it did find a bug in my code where I wouldn't allow instruction
fetches to fault in pages in a pkey-protected area, so it wasn't a
completely worthless exercise.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Dave Hansen
On 10/01/2015 11:23 PM, Ingo Molnar wrote:
>> > Also, how do we do mprotect_pkey and say "don't change the key"?
> So if we start managing keys as a resource (i.e. alloc/free up to 16 of 
> them), and 
> provide APIs for user-space to do all that, then user-space is not supposed 
> to 
> touch keys it has not allocated for itself - just like it's not supposed to 
> write 
> to fds it has not opened.

I like that.  It gives us at least a "soft" indicator to userspace about
what keys it should or shouldn't be using.

> Such an allocation method can still 'mess up', and if the kernel allocates a 
> key 
> for its purposes it should not assume that user-space cannot change it, but 
> at 
> least for non-buggy code there's no interaction and it would work out fine.

Yeah.  It also provides a clean interface so that future hardware could
enforce enforce kernel "ownership" of a key which could protect against
even buggy code.

So, we add a pair of syscalls,

unsigned long sys_alloc_pkey(unsigned long flags??)
unsigned long sys_free_pkey(unsigned long pkey)

keep the metadata in the mm, and then make sure that userspace allocated
it before it is allowed to do an mprotect_pkey() with it.

mprotect_pkey(add, flags, pkey)
{
if (!(mm->pkeys_allocated & (1 << pkey))
return -EINVAL;
}

That should be pretty easy to implement.  The only real overhead is the
16 bits we need to keep in the mm somewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-02 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen  wrote:
> >
> > Here it is in a quite fugly form (well, it's not opt-in).  Init crashes if 
> > I 
> > boot with this, though.
> >
> > I'll see if I can turn it in to a bit more of an opt-in and see what's 
> > actually going wrong.
> 
> It's quite likely that you will find that compilers put read-only constants 
> in 
> the text section, knowing that executable means readable.

At least with pkeys enabling true --x mappings, that compiler practice becomes 
a 
(mild) security problem: it provides a readable and executable return target 
for 
stack/buffer overflow attacks - FWIIW. (It's a limited concern because the true 
code areas are executable already.)

I'd expect such readonly data to eventually move out into the regular data 
sections, the moment the kernel gives a tool to distros to enforce true 
PROT_EXEC 
mappings.

> So it's entirely possible that it's pretty much all over.

I'd expect that too.

> That said, I don't understand your patch. Why check PROT_WRITE? We've had
> :"execute but not write" forever. It's "execute and not *read*" that is
> interesting.

Yeah, but almost none of user-space seems to be using it.

> So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

I don't think you are mis-reading it: my (hacky! bad! not signed off!) debug 
idea 
was to fudge PROT_EXEC|PROT_READ bits into pure PROT_EXEC only - at least to 
get 
pkeys used in a much more serious fashion than standalone testcases, without 
having to change the distro itself.

You are probably right that true data reads from executable sections are very 
common, so this might not be a viable technique even for testing purposes.

But worth a try.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Linus Torvalds
On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen  wrote:
>
> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
> but aren't _really_ going to read it, so we can safely deny them all
> access other than exec.

That's a completely insane assumption. There are tons of reasons to
have code and read-only data in the same segment, and it's very
traditional. Just assuming that you only execute out of something that
has PROT_EXEC | PROT_READ is insane.

No, what you *should* look at is to use the protection keys to
actually enforce a plain PROT_EXEC. That has never worked before
(because traditionally R implies X, and then we got NX).

That would at least allow people who know they don't intersperse
read-only constants in the code to use PROT_EXE only.

Of course, there may well be users who use PROT_EXE that actually *do*
do reads, and just relied on the old hardware behavior. So it's not
guaranteed to work either without any extra flags. But at least it's
worth a try, unlike the "yeah, the user asked for read, but the user
doesn't know what he's doing" thinking that is just crazy talk.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Andy Lutomirski
On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen  wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>>> So could we try to add an (opt-in) kernel option that enables this 
>>> transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

Somebody really ought to rework things so that a crash in init prints
out a normal indication of the unhandled signal and optionally leaves
everything else running.

Also...

EPT seems to have separate R, W, and X flags.  I wonder if it would
make sense to add a KVM paravirt feature that maps the entire guest
physical space an extra time at a monstrous offset with R cleared in
the EPT and passes through a #PF or other notification (KVM-specific
thing? #VE?) on a read fault.

This wouldn't even need a whole duplicate paging hierarchy -- it would
just duplicate the EPT PML4 entries, so it would add exactly zero
runtime memory usage.

The guest would use it by treating the high bit of the physical
address as a "may read" bit.

This reminds me -- we should probably wire up X86_TRAP_VE with a stub
that OOPSes until someone figures out some more useful thing to do.
We're probably not doing anyone any favors by unconditionally
promoting them to double-faults.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 03:48 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen  wrote:
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
>>
>> I'll see if I can turn it in to a bit more of an opt-in and see what's
>> actually going wrong.
...
> That said, I don't understand your patch. Why check PROT_WRITE? We've
> had :"execute but not write" forever. It's "execute and not *read*"
> that is interesting.

I was thinking that almost anybody doing a PROT_WRITE|PROT_EXEC really
*is* going to write to it so they'll notice pretty fast if we completely
deny them access to it.

Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
but aren't _really_ going to read it, so we can safely deny them all
access other than exec.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Linus Torvalds
On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen  wrote:
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
>
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

It's quite likely that you will find that compilers put read-only
constants in the text section, knowing that executable means readable.

So it's entirely possible that it's pretty much all over.

That said, I don't understand your patch. Why check PROT_WRITE? We've
had :"execute but not write" forever. It's "execute and not *read*"
that is interesting.

So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 03:35 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen  wrote:
>> On 10/01/2015 01:39 PM, Kees Cook wrote:
>>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
 So could we try to add an (opt-in) kernel option that enables this 
 transparently
 and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
 user-space changes and syscalls necessary?
>>>
>>> I would like this very much. :)
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
> 
> *laugh* Okay... well, we've got some work to do, I guess. :)
> 
> (And which init?)

systemd for better or worse.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Kees Cook
On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen  wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>>> So could we try to add an (opt-in) kernel option that enables this 
>>> transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

*laugh* Okay... well, we've got some work to do, I guess. :)

(And which init?)

> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

Cool, thanks!

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 01:39 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>> So could we try to add an (opt-in) kernel option that enables this 
>> transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
> 
> I would like this very much. :)

Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
if I boot with this, though.

I'll see if I can turn it in to a bit more of an opt-in and see what's
actually going wrong.




---

 b/arch/x86/include/asm/fpu/internal.h |4 
 b/arch/x86/kernel/fpu/core.c  |4 
 b/arch/x86/kernel/fpu/xstate.c|   16 +++-
 b/arch/x86/mm/fault.c |8 ++--
 b/include/linux/mm_types.h|1 +
 b/kernel/fork.c   |3 ++-
 b/kernel/sched/core.c |3 +++
 b/mm/mmap.c   |8 +++-
 b/mm/mprotect.c   |   27 ++-
 9 files changed, 68 insertions(+), 6 deletions(-)

diff -puN mm/mprotect.c~pkeys-95-rewire-mprotect-to-use-pkeys mm/mprotect.c
--- a/mm/mprotect.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.183874598 -0700
+++ b/mm/mprotect.c	2015-10-01 15:28:14.741262888 -0700
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -453,10 +454,34 @@ out:
 	return error;
 }
 
+u32 __read_mostly mprotect_hack_pkey = 1;
+int mprotect_hack_pkey_init(void)
+{
+   debugfs_create_u32("mprotect_hack_pkey",  S_IRUSR | S_IWUSR,
+   NULL, _hack_pkey);
+   return 0;
+}
+late_initcall(mprotect_hack_pkey_init);
+
+int pkey_for_access_protect = 1;
+int pkey_for_write_protect = 2;
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
-	return do_mprotect_key(start, len, prot, 0);
+	int ret;
+	unsigned long newprot = prot;
+	u32 pkey_hack = READ_ONCE(mprotect_hack_pkey);
+	u16 pkey = 0;
+
+	if (!pkey_hack)
+		return do_mprotect_key(start, len, prot, 0);
+
+	if ((prot & PROT_EXEC) && !(prot & PROT_WRITE))
+		pkey = pkey_for_access_protect;
+
+	ret = do_mprotect_key(start, len, newprot, pkey);
+
+	return ret;
 }
 
 SYSCALL_DEFINE4(mprotect_key, unsigned long, start, size_t, len,
diff -puN include/linux/mm_types.h~pkeys-95-rewire-mprotect-to-use-pkeys include/linux/mm_types.h
--- a/include/linux/mm_types.h~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.185874687 -0700
+++ b/include/linux/mm_types.h	2015-10-01 15:21:25.227876573 -0700
@@ -486,6 +486,7 @@ struct mm_struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+	u32 fake_mprotect_pkey;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff -puN kernel/fork.c~pkeys-95-rewire-mprotect-to-use-pkeys kernel/fork.c
--- a/kernel/fork.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.187874777 -0700
+++ b/kernel/fork.c	2015-10-01 15:21:25.228876618 -0700
@@ -927,6 +927,7 @@ static struct mm_struct *dup_mm(struct t
 
 	mm->hiwater_rss = get_mm_rss(mm);
 	mm->hiwater_vm = mm->total_vm;
+	mm->fake_mprotect_pkey = 0;
 
 	if (mm->binfmt && !try_module_get(mm->binfmt->module))
 		goto free_pt;
@@ -1700,7 +1701,7 @@ long _do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-
+	//printk("%s()\n", __func__);
 	/*
 	 * Determine whether and which event to report to ptracer.  When
 	 * called from kernel_thread or CLONE_UNTRACED is explicitly
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.197875226 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-10-01 15:21:25.228876618 -0700
@@ -41,6 +41,17 @@ u64 xfeatures_mask __read_mostly;
 static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask)*8];
+void hack_fpstate_for_pkru(struct xregs_state *xstate)
+{
+void *__pkru;
+xstate->header.xfeatures |= XFEATURE_MASK_PKRU;
+__pkru = ((char *)xstate) + xstate_offsets[XFEATURE_PKRU];
+	/*
+	 * Access disable PKEY 1 and
+	 * Write disable PKEY 2
+	 */
+*(u32 *)__pkru = 0x0024;
+}
 
 /*
  * Clear all of the X86_FEATURE_* bits that are unavailable
@@ -321,7 +332,10 @@ static void __init setup_init_fpu_buf(vo
 		init_fpstate.xsave.header.xcomp_bv = (u64)1 << 63 | xfeatures_mask;
 		init_fpstate.xsave.header.xfeatures = xfeatures_mask;
 	}
-
+	{
+		void hack_fpstate_for_pkru(struct xregs_state *xstate);
+		hack_fpstate_for_pkru(_fpstate.xsave);
+	}
 	/*
 	 * Init all the features state with header_bv being 0x0
 	 */
diff -puN arch/x86/mm/fault.c~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/mm/fault.c
--- 

Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 01:39 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>> * Dave Hansen  wrote:
 If yes then this could be a significant security feature / usecase for 
 pkeys:
> 
> Which CPUs (will) have pkeys?

It hasn't been announced publicly, so all I can say here is "future ones".

 executable sections of shared libraries and binaries could be mapped with 
 pkey
 access disabled. If I read the Intel documentation correctly then that 
 should
 be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this 
>> transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
> 
> I would like this very much. :)

I'll go hack something together and see what breaks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Andy Lutomirski
On Thu, Oct 1, 2015 at 1:39 PM, Kees Cook  wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>>
>> * Dave Hansen  wrote:
>>
>>> > If yes then this could be a significant security feature / usecase for 
>>> > pkeys:
>
> Which CPUs (will) have pkeys?
>
>>> > executable sections of shared libraries and binaries could be mapped with 
>>> > pkey
>>> > access disabled. If I read the Intel documentation correctly then that 
>>> > should
>>> > be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this 
>> transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
>
> I would like this very much. :)
>
>> Beyond the security improvement, this would enable this hardware feature on 
>> most
>> x86 Linux distros automatically, on supported hardware, which is good for 
>> testing.
>>
>> Assuming it boots up fine on a typical distro, i.e. assuming that there are 
>> no
>> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
>
> I can't wait to find out what implicitly expects PROT_READ from
> PROT_EXEC mappings. :)

There's one annoying issue at least:

mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
do?  What if the caller actually wants key 0?  What if some CPU vendor
some day implements --x for real?


Also, how do we do mprotect_pkey and say "don't change the key"?

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Kees Cook
On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>
> * Dave Hansen  wrote:
>
>> > If yes then this could be a significant security feature / usecase for 
>> > pkeys:

Which CPUs (will) have pkeys?

>> > executable sections of shared libraries and binaries could be mapped with 
>> > pkey
>> > access disabled. If I read the Intel documentation correctly then that 
>> > should
>> > be possible.
>>
>> Agreed.  I've even heard from some researchers who are interested in this:
>>
>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
> So could we try to add an (opt-in) kernel option that enables this 
> transparently
> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> user-space changes and syscalls necessary?

I would like this very much. :)

> Beyond the security improvement, this would enable this hardware feature on 
> most
> x86 Linux distros automatically, on supported hardware, which is good for 
> testing.
>
> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
> surprises where PROT_READ && PROT_EXEC sections are accessed as data.

I can't wait to find out what implicitly expects PROT_READ from
PROT_EXEC mappings. :)

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Ingo Molnar

* Dave Hansen  wrote:

> > If yes then this could be a significant security feature / usecase for 
> > pkeys: 
> > executable sections of shared libraries and binaries could be mapped with 
> > pkey 
> > access disabled. If I read the Intel documentation correctly then that 
> > should 
> > be possible.
> 
> Agreed.  I've even heard from some researchers who are interested in this:
> 
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

So could we try to add an (opt-in) kernel option that enables this 
transparently 
and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any 
user-space changes and syscalls necessary?

Beyond the security improvement, this would enable this hardware feature on 
most 
x86 Linux distros automatically, on supported hardware, which is good for 
testing.

Assuming it boots up fine on a typical distro, i.e. assuming that there are no 
surprises where PROT_READ && PROT_EXEC sections are accessed as data.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Ingo Molnar

* Dave Hansen  wrote:

> > If yes then this could be a significant security feature / usecase for 
> > pkeys: 
> > executable sections of shared libraries and binaries could be mapped with 
> > pkey 
> > access disabled. If I read the Intel documentation correctly then that 
> > should 
> > be possible.
> 
> Agreed.  I've even heard from some researchers who are interested in this:
> 
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

So could we try to add an (opt-in) kernel option that enables this 
transparently 
and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any 
user-space changes and syscalls necessary?

Beyond the security improvement, this would enable this hardware feature on 
most 
x86 Linux distros automatically, on supported hardware, which is good for 
testing.

Assuming it boots up fine on a typical distro, i.e. assuming that there are no 
surprises where PROT_READ && PROT_EXEC sections are accessed as data.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Andy Lutomirski
On Thu, Oct 1, 2015 at 1:39 PM, Kees Cook  wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>>
>> * Dave Hansen  wrote:
>>
>>> > If yes then this could be a significant security feature / usecase for 
>>> > pkeys:
>
> Which CPUs (will) have pkeys?
>
>>> > executable sections of shared libraries and binaries could be mapped with 
>>> > pkey
>>> > access disabled. If I read the Intel documentation correctly then that 
>>> > should
>>> > be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this 
>> transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
>
> I would like this very much. :)
>
>> Beyond the security improvement, this would enable this hardware feature on 
>> most
>> x86 Linux distros automatically, on supported hardware, which is good for 
>> testing.
>>
>> Assuming it boots up fine on a typical distro, i.e. assuming that there are 
>> no
>> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
>
> I can't wait to find out what implicitly expects PROT_READ from
> PROT_EXEC mappings. :)

There's one annoying issue at least:

mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
do?  What if the caller actually wants key 0?  What if some CPU vendor
some day implements --x for real?


Also, how do we do mprotect_pkey and say "don't change the key"?

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Kees Cook
On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>
> * Dave Hansen  wrote:
>
>> > If yes then this could be a significant security feature / usecase for 
>> > pkeys:

Which CPUs (will) have pkeys?

>> > executable sections of shared libraries and binaries could be mapped with 
>> > pkey
>> > access disabled. If I read the Intel documentation correctly then that 
>> > should
>> > be possible.
>>
>> Agreed.  I've even heard from some researchers who are interested in this:
>>
>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
> So could we try to add an (opt-in) kernel option that enables this 
> transparently
> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> user-space changes and syscalls necessary?

I would like this very much. :)

> Beyond the security improvement, this would enable this hardware feature on 
> most
> x86 Linux distros automatically, on supported hardware, which is good for 
> testing.
>
> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
> surprises where PROT_READ && PROT_EXEC sections are accessed as data.

I can't wait to find out what implicitly expects PROT_READ from
PROT_EXEC mappings. :)

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 01:39 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>> * Dave Hansen  wrote:
 If yes then this could be a significant security feature / usecase for 
 pkeys:
> 
> Which CPUs (will) have pkeys?

It hasn't been announced publicly, so all I can say here is "future ones".

 executable sections of shared libraries and binaries could be mapped with 
 pkey
 access disabled. If I read the Intel documentation correctly then that 
 should
 be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this 
>> transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
> 
> I would like this very much. :)

I'll go hack something together and see what breaks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Linus Torvalds
On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen  wrote:
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
>
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

It's quite likely that you will find that compilers put read-only
constants in the text section, knowing that executable means readable.

So it's entirely possible that it's pretty much all over.

That said, I don't understand your patch. Why check PROT_WRITE? We've
had :"execute but not write" forever. It's "execute and not *read*"
that is interesting.

So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 03:48 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen  wrote:
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
>>
>> I'll see if I can turn it in to a bit more of an opt-in and see what's
>> actually going wrong.
...
> That said, I don't understand your patch. Why check PROT_WRITE? We've
> had :"execute but not write" forever. It's "execute and not *read*"
> that is interesting.

I was thinking that almost anybody doing a PROT_WRITE|PROT_EXEC really
*is* going to write to it so they'll notice pretty fast if we completely
deny them access to it.

Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
but aren't _really_ going to read it, so we can safely deny them all
access other than exec.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Kees Cook
On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen  wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>>> So could we try to add an (opt-in) kernel option that enables this 
>>> transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

*laugh* Okay... well, we've got some work to do, I guess. :)

(And which init?)

> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

Cool, thanks!

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 03:35 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen  wrote:
>> On 10/01/2015 01:39 PM, Kees Cook wrote:
>>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
 So could we try to add an (opt-in) kernel option that enables this 
 transparently
 and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
 user-space changes and syscalls necessary?
>>>
>>> I would like this very much. :)
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
> 
> *laugh* Okay... well, we've got some work to do, I guess. :)
> 
> (And which init?)

systemd for better or worse.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Andy Lutomirski
On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen  wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>>> So could we try to add an (opt-in) kernel option that enables this 
>>> transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

Somebody really ought to rework things so that a crash in init prints
out a normal indication of the unhandled signal and optionally leaves
everything else running.

Also...

EPT seems to have separate R, W, and X flags.  I wonder if it would
make sense to add a KVM paravirt feature that maps the entire guest
physical space an extra time at a monstrous offset with R cleared in
the EPT and passes through a #PF or other notification (KVM-specific
thing? #VE?) on a read fault.

This wouldn't even need a whole duplicate paging hierarchy -- it would
just duplicate the EPT PML4 entries, so it would add exactly zero
runtime memory usage.

The guest would use it by treating the high bit of the physical
address as a "may read" bit.

This reminds me -- we should probably wire up X86_TRAP_VE with a stub
that OOPSes until someone figures out some more useful thing to do.
We're probably not doing anyone any favors by unconditionally
promoting them to double-faults.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Dave Hansen
On 10/01/2015 01:39 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar  wrote:
>> So could we try to add an (opt-in) kernel option that enables this 
>> transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
> 
> I would like this very much. :)

Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
if I boot with this, though.

I'll see if I can turn it in to a bit more of an opt-in and see what's
actually going wrong.




---

 b/arch/x86/include/asm/fpu/internal.h |4 
 b/arch/x86/kernel/fpu/core.c  |4 
 b/arch/x86/kernel/fpu/xstate.c|   16 +++-
 b/arch/x86/mm/fault.c |8 ++--
 b/include/linux/mm_types.h|1 +
 b/kernel/fork.c   |3 ++-
 b/kernel/sched/core.c |3 +++
 b/mm/mmap.c   |8 +++-
 b/mm/mprotect.c   |   27 ++-
 9 files changed, 68 insertions(+), 6 deletions(-)

diff -puN mm/mprotect.c~pkeys-95-rewire-mprotect-to-use-pkeys mm/mprotect.c
--- a/mm/mprotect.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.183874598 -0700
+++ b/mm/mprotect.c	2015-10-01 15:28:14.741262888 -0700
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -453,10 +454,34 @@ out:
 	return error;
 }
 
+u32 __read_mostly mprotect_hack_pkey = 1;
+int mprotect_hack_pkey_init(void)
+{
+   debugfs_create_u32("mprotect_hack_pkey",  S_IRUSR | S_IWUSR,
+   NULL, _hack_pkey);
+   return 0;
+}
+late_initcall(mprotect_hack_pkey_init);
+
+int pkey_for_access_protect = 1;
+int pkey_for_write_protect = 2;
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
-	return do_mprotect_key(start, len, prot, 0);
+	int ret;
+	unsigned long newprot = prot;
+	u32 pkey_hack = READ_ONCE(mprotect_hack_pkey);
+	u16 pkey = 0;
+
+	if (!pkey_hack)
+		return do_mprotect_key(start, len, prot, 0);
+
+	if ((prot & PROT_EXEC) && !(prot & PROT_WRITE))
+		pkey = pkey_for_access_protect;
+
+	ret = do_mprotect_key(start, len, newprot, pkey);
+
+	return ret;
 }
 
 SYSCALL_DEFINE4(mprotect_key, unsigned long, start, size_t, len,
diff -puN include/linux/mm_types.h~pkeys-95-rewire-mprotect-to-use-pkeys include/linux/mm_types.h
--- a/include/linux/mm_types.h~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.185874687 -0700
+++ b/include/linux/mm_types.h	2015-10-01 15:21:25.227876573 -0700
@@ -486,6 +486,7 @@ struct mm_struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+	u32 fake_mprotect_pkey;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff -puN kernel/fork.c~pkeys-95-rewire-mprotect-to-use-pkeys kernel/fork.c
--- a/kernel/fork.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.187874777 -0700
+++ b/kernel/fork.c	2015-10-01 15:21:25.228876618 -0700
@@ -927,6 +927,7 @@ static struct mm_struct *dup_mm(struct t
 
 	mm->hiwater_rss = get_mm_rss(mm);
 	mm->hiwater_vm = mm->total_vm;
+	mm->fake_mprotect_pkey = 0;
 
 	if (mm->binfmt && !try_module_get(mm->binfmt->module))
 		goto free_pt;
@@ -1700,7 +1701,7 @@ long _do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-
+	//printk("%s()\n", __func__);
 	/*
 	 * Determine whether and which event to report to ptracer.  When
 	 * called from kernel_thread or CLONE_UNTRACED is explicitly
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.197875226 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-10-01 15:21:25.228876618 -0700
@@ -41,6 +41,17 @@ u64 xfeatures_mask __read_mostly;
 static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask)*8];
+void hack_fpstate_for_pkru(struct xregs_state *xstate)
+{
+void *__pkru;
+xstate->header.xfeatures |= XFEATURE_MASK_PKRU;
+__pkru = ((char *)xstate) + xstate_offsets[XFEATURE_PKRU];
+	/*
+	 * Access disable PKEY 1 and
+	 * Write disable PKEY 2
+	 */
+*(u32 *)__pkru = 0x0024;
+}
 
 /*
  * Clear all of the X86_FEATURE_* bits that are unavailable
@@ -321,7 +332,10 @@ static void __init setup_init_fpu_buf(vo
 		init_fpstate.xsave.header.xcomp_bv = (u64)1 << 63 | xfeatures_mask;
 		init_fpstate.xsave.header.xfeatures = xfeatures_mask;
 	}
-
+	{
+		void hack_fpstate_for_pkru(struct xregs_state *xstate);
+		hack_fpstate_for_pkru(_fpstate.xsave);
+	}
 	/*
 	 * Init all the features state with header_bv being 0x0
 	 */
diff -puN arch/x86/mm/fault.c~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/mm/fault.c
--- 

Re: [PATCH 26/26] x86, pkeys: Documentation

2015-10-01 Thread Linus Torvalds
On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen  wrote:
>
> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
> but aren't _really_ going to read it, so we can safely deny them all
> access other than exec.

That's a completely insane assumption. There are tons of reasons to
have code and read-only data in the same segment, and it's very
traditional. Just assuming that you only execute out of something that
has PROT_EXEC | PROT_READ is insane.

No, what you *should* look at is to use the protection keys to
actually enforce a plain PROT_EXEC. That has never worked before
(because traditionally R implies X, and then we got NX).

That would at least allow people who know they don't intersperse
read-only constants in the code to use PROT_EXE only.

Of course, there may well be users who use PROT_EXE that actually *do*
do reads, and just relied on the old hardware behavior. So it's not
guaranteed to work either without any extra flags. But at least it's
worth a try, unlike the "yeah, the user asked for read, but the user
doesn't know what he's doing" thinking that is just crazy talk.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-25 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> This may mean that we want to have a way for binaries to indicate that they 
> want 
> their --x segments to be loaded with a particular protection key.  The right 
> way 
> to do that might be using an ELF note, and I also want to use ELF notes to 
> allow 
> turning off vsyscalls, so maybe it's time to write an ELF note parser in the 
> kernel.

That would be absolutely lovely for many other reasons as well, and we should 
also 
add a tool to tools/ to edit/expand/shrink those ELF notes on existing systems.

I.e. make it really easy to augment security policies on an existing distro, 
using 
any filesystem (not just ACL capable ones) and using the binary only. Linux 
binaries could carry capabilities information, etc. etc.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-25 Thread Ingo Molnar

* Dave Hansen  wrote:

> > I.e. AFAICS pkeys could be used to create true '--x' permissions for 
> > executable 
> > (user-space) pages.
> 
> Just remember that all of the protections are dependent on the contents of 
> PKRU.  
> If an attacker controls the Access-Disable bit in PKRU for the 
> executable-only 
> region, you're sunk.

The same is true if the attacker can execute mprotect() calls.

> But, that either requires being able to construct and execute arbitrary code 
> *or* call existing code that sets PKRU to the desired values. Which, I guess, 
> gets harder to do if all of the the wrpkru's are *in* the execute-only area.

Exactly. True --x executable regions makes it harder to 'upgrade' limited 
attacks.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-25 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> This may mean that we want to have a way for binaries to indicate that they 
> want 
> their --x segments to be loaded with a particular protection key.  The right 
> way 
> to do that might be using an ELF note, and I also want to use ELF notes to 
> allow 
> turning off vsyscalls, so maybe it's time to write an ELF note parser in the 
> kernel.

That would be absolutely lovely for many other reasons as well, and we should 
also 
add a tool to tools/ to edit/expand/shrink those ELF notes on existing systems.

I.e. make it really easy to augment security policies on an existing distro, 
using 
any filesystem (not just ACL capable ones) and using the binary only. Linux 
binaries could carry capabilities information, etc. etc.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-25 Thread Ingo Molnar

* Dave Hansen  wrote:

> > I.e. AFAICS pkeys could be used to create true '--x' permissions for 
> > executable 
> > (user-space) pages.
> 
> Just remember that all of the protections are dependent on the contents of 
> PKRU.  
> If an attacker controls the Access-Disable bit in PKRU for the 
> executable-only 
> region, you're sunk.

The same is true if the attacker can execute mprotect() calls.

> But, that either requires being able to construct and execute arbitrary code 
> *or* call existing code that sets PKRU to the desired values. Which, I guess, 
> gets harder to do if all of the the wrpkru's are *in* the execute-only area.

Exactly. True --x executable regions makes it harder to 'upgrade' limited 
attacks.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-24 Thread Andy Lutomirski
On Thu, Sep 24, 2015 at 12:10 PM, Dave Hansen  wrote:
> On 09/24/2015 02:49 AM, Ingo Molnar wrote:
>> * Dave Hansen  wrote:
 Another question, related to enumeration as well: I'm wondering whether
 there's any way for the kernel to allocate a bit or two for its own 
 purposes -
 such as protecting crypto keys? Or is the facility fundamentally intended 
 for
 user-space use only?
>>>
>>> No, that's not possible with the current setup.
>>
>> Ok, then another question, have you considered the following usecase:
>>
>> AFAICS pkeys only affect data loads and stores. Instruction fetches are 
>> notably
>> absent from the documentation. Can you clarify that instructions can be 
>> fetched
>> and executed from PTE_READ but pkeys-all-access-disabled pags?
>
> That is my understanding.  I don't have a test for it, but I'll go make one.
>
>> If yes then this could be a significant security feature / usecase for pkeys:
>> executable sections of shared libraries and binaries could be mapped with 
>> pkey
>> access disabled. If I read the Intel documentation correctly then that 
>> should be
>> possible.
>
> Agreed.  I've even heard from some researchers who are interested in this:
>
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
>> I.e. AFAICS pkeys could be used to create true '--x' permissions for 
>> executable
>> (user-space) pages.
>
> Just remember that all of the protections are dependent on the contents
> of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
> executable-only region, you're sunk.
>
> But, that either requires being able to construct and execute arbitrary
> code *or* call existing code that sets PKRU to the desired values.
> Which, I guess, gets harder to do if all of the the wrpkru's are *in*
> the execute-only area.
>

This may mean that we want to have a way for binaries to indicate that
they want their --x segments to be loaded with a particular protection
key.  The right way to do that might be using an ELF note, and I also
want to use ELF notes to allow turning off vsyscalls, so maybe it's
time to write an ELF note parser in the kernel.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-24 Thread Dave Hansen
On 09/24/2015 02:49 AM, Ingo Molnar wrote:
> * Dave Hansen  wrote:
>>> Another question, related to enumeration as well: I'm wondering whether 
>>> there's any way for the kernel to allocate a bit or two for its own 
>>> purposes - 
>>> such as protecting crypto keys? Or is the facility fundamentally intended 
>>> for 
>>> user-space use only?
>>
>> No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:
> 
> AFAICS pkeys only affect data loads and stores. Instruction fetches are 
> notably 
> absent from the documentation. Can you clarify that instructions can be 
> fetched 
> and executed from PTE_READ but pkeys-all-access-disabled pags?

That is my understanding.  I don't have a test for it, but I'll go make one.

> If yes then this could be a significant security feature / usecase for pkeys: 
> executable sections of shared libraries and binaries could be mapped with 
> pkey 
> access disabled. If I read the Intel documentation correctly then that should 
> be 
> possible.

Agreed.  I've even heard from some researchers who are interested in this:

https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

> I.e. AFAICS pkeys could be used to create true '--x' permissions for 
> executable 
> (user-space) pages.

Just remember that all of the protections are dependent on the contents
of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
executable-only region, you're sunk.

But, that either requires being able to construct and execute arbitrary
code *or* call existing code that sets PKRU to the desired values.
Which, I guess, gets harder to do if all of the the wrpkru's are *in*
the execute-only area.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-24 Thread Ingo Molnar

* Dave Hansen  wrote:

> > Another question, related to enumeration as well: I'm wondering whether 
> > there's any way for the kernel to allocate a bit or two for its own 
> > purposes - 
> > such as protecting crypto keys? Or is the facility fundamentally intended 
> > for 
> > user-space use only?
> 
> No, that's not possible with the current setup.

Ok, then another question, have you considered the following usecase:

AFAICS pkeys only affect data loads and stores. Instruction fetches are notably 
absent from the documentation. Can you clarify that instructions can be fetched 
and executed from PTE_READ but pkeys-all-access-disabled pags?

If yes then this could be a significant security feature / usecase for pkeys: 
executable sections of shared libraries and binaries could be mapped with pkey 
access disabled. If I read the Intel documentation correctly then that should 
be 
possible.

The advantage of doing that is that an existing attack method to circumvent 
ASLR 
(or to scout out an unknown binary) is to use an existing (user-space) 
information 
leak to read the address space of a server process - and to use that to figure 
out 
the actual code present at that address.

The code signature can then be be used to identify the precise layout of the 
binary, and/or to create ROP gadgets - to escallate permissions using an 
otherwise 
not exploitable buffer overflow.

I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
(user-space) pages.

But I might be reading it wrong ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-24 Thread Ingo Molnar

* Dave Hansen  wrote:

> > Another question, related to enumeration as well: I'm wondering whether 
> > there's any way for the kernel to allocate a bit or two for its own 
> > purposes - 
> > such as protecting crypto keys? Or is the facility fundamentally intended 
> > for 
> > user-space use only?
> 
> No, that's not possible with the current setup.

Ok, then another question, have you considered the following usecase:

AFAICS pkeys only affect data loads and stores. Instruction fetches are notably 
absent from the documentation. Can you clarify that instructions can be fetched 
and executed from PTE_READ but pkeys-all-access-disabled pags?

If yes then this could be a significant security feature / usecase for pkeys: 
executable sections of shared libraries and binaries could be mapped with pkey 
access disabled. If I read the Intel documentation correctly then that should 
be 
possible.

The advantage of doing that is that an existing attack method to circumvent 
ASLR 
(or to scout out an unknown binary) is to use an existing (user-space) 
information 
leak to read the address space of a server process - and to use that to figure 
out 
the actual code present at that address.

The code signature can then be be used to identify the precise layout of the 
binary, and/or to create ROP gadgets - to escallate permissions using an 
otherwise 
not exploitable buffer overflow.

I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
(user-space) pages.

But I might be reading it wrong ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-24 Thread Andy Lutomirski
On Thu, Sep 24, 2015 at 12:10 PM, Dave Hansen  wrote:
> On 09/24/2015 02:49 AM, Ingo Molnar wrote:
>> * Dave Hansen  wrote:
 Another question, related to enumeration as well: I'm wondering whether
 there's any way for the kernel to allocate a bit or two for its own 
 purposes -
 such as protecting crypto keys? Or is the facility fundamentally intended 
 for
 user-space use only?
>>>
>>> No, that's not possible with the current setup.
>>
>> Ok, then another question, have you considered the following usecase:
>>
>> AFAICS pkeys only affect data loads and stores. Instruction fetches are 
>> notably
>> absent from the documentation. Can you clarify that instructions can be 
>> fetched
>> and executed from PTE_READ but pkeys-all-access-disabled pags?
>
> That is my understanding.  I don't have a test for it, but I'll go make one.
>
>> If yes then this could be a significant security feature / usecase for pkeys:
>> executable sections of shared libraries and binaries could be mapped with 
>> pkey
>> access disabled. If I read the Intel documentation correctly then that 
>> should be
>> possible.
>
> Agreed.  I've even heard from some researchers who are interested in this:
>
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
>> I.e. AFAICS pkeys could be used to create true '--x' permissions for 
>> executable
>> (user-space) pages.
>
> Just remember that all of the protections are dependent on the contents
> of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
> executable-only region, you're sunk.
>
> But, that either requires being able to construct and execute arbitrary
> code *or* call existing code that sets PKRU to the desired values.
> Which, I guess, gets harder to do if all of the the wrpkru's are *in*
> the execute-only area.
>

This may mean that we want to have a way for binaries to indicate that
they want their --x segments to be loaded with a particular protection
key.  The right way to do that might be using an ELF note, and I also
want to use ELF notes to allow turning off vsyscalls, so maybe it's
time to write an ELF note parser in the kernel.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-24 Thread Dave Hansen
On 09/24/2015 02:49 AM, Ingo Molnar wrote:
> * Dave Hansen  wrote:
>>> Another question, related to enumeration as well: I'm wondering whether 
>>> there's any way for the kernel to allocate a bit or two for its own 
>>> purposes - 
>>> such as protecting crypto keys? Or is the facility fundamentally intended 
>>> for 
>>> user-space use only?
>>
>> No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:
> 
> AFAICS pkeys only affect data loads and stores. Instruction fetches are 
> notably 
> absent from the documentation. Can you clarify that instructions can be 
> fetched 
> and executed from PTE_READ but pkeys-all-access-disabled pags?

That is my understanding.  I don't have a test for it, but I'll go make one.

> If yes then this could be a significant security feature / usecase for pkeys: 
> executable sections of shared libraries and binaries could be mapped with 
> pkey 
> access disabled. If I read the Intel documentation correctly then that should 
> be 
> possible.

Agreed.  I've even heard from some researchers who are interested in this:

https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

> I.e. AFAICS pkeys could be used to create true '--x' permissions for 
> executable 
> (user-space) pages.

Just remember that all of the protections are dependent on the contents
of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
executable-only region, you're sunk.

But, that either requires being able to construct and execute arbitrary
code *or* call existing code that sets PKRU to the desired values.
Which, I guess, gets harder to do if all of the the wrpkru's are *in*
the execute-only area.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-20 Thread Dave Hansen
On 09/20/2015 01:55 AM, Ingo Molnar wrote:
> * Dave Hansen  wrote:
>> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
>> +which will be found on future Intel CPUs.
>> +
>> +Memory Protection Keys provides a mechanism for enforcing page-based
>> +protections, but without requiring modification of the page tables
>> +when an application changes protection domains.  It works by
>> +dedicating 4 previously ignored bits in each page table entry to a
>> +"protection key", giving 16 possible keys.
> 
> Wondering how user-space is supposed to discover the number of protection 
> keys,
> is that CPUID leaf based, or hardcoded on the CPU feature bit?

The 16 keys are essentially hard-coded from the cpuid bit.

>> +There is also a new user-accessible register (PKRU) with two separate
>> +bits (Access Disable and Write Disable) for each key.  Being a CPU
>> +register, PKRU is inherently thread-local, potentially giving each
>> +thread a different set of protections from every other thread.
>> +
>> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
>> +to the new register.  The feature is only available in 64-bit mode,
>> +even though there is theoretically space in the PAE PTEs.  These
>> +permissions are enforced on data access only and have no effect on
>> +instruction fetches.
> 
> Another question, related to enumeration as well: I'm wondering whether 
> there's 
> any way for the kernel to allocate a bit or two for its own purposes - such 
> as 
> protecting crypto keys? Or is the facility fundamentally intended for 
> user-space 
> use only?

No, that's not possible with the current setup.

Userspace has complete control over the contents of the PKRU register
with unprivileged instructions.  So the kernel can not practically
protect any of its own data with this.

> Similarly, the pmem (persistent memory) driver could employ protection keys 
> to 
> keep terabytes of data 'masked out' most of the time - protecting data from 
> kernel 
> space memory corruption bugs.

I wish we could do this, but we can not with the current implementation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-20 Thread Ingo Molnar

* Dave Hansen  wrote:

> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> +which will be found on future Intel CPUs.
> +
> +Memory Protection Keys provides a mechanism for enforcing page-based
> +protections, but without requiring modification of the page tables
> +when an application changes protection domains.  It works by
> +dedicating 4 previously ignored bits in each page table entry to a
> +"protection key", giving 16 possible keys.

Wondering how user-space is supposed to discover the number of protection keys,
is that CPUID leaf based, or hardcoded on the CPU feature bit?

> +There is also a new user-accessible register (PKRU) with two separate
> +bits (Access Disable and Write Disable) for each key.  Being a CPU
> +register, PKRU is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.
> +
> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> +to the new register.  The feature is only available in 64-bit mode,
> +even though there is theoretically space in the PAE PTEs.  These
> +permissions are enforced on data access only and have no effect on
> +instruction fetches.

Another question, related to enumeration as well: I'm wondering whether there's 
any way for the kernel to allocate a bit or two for its own purposes - such as 
protecting crypto keys? Or is the facility fundamentally intended for 
user-space 
use only?

Just a quick example: let's assume the kernel has an information leak hole, a 
way 
to read any kernel address and pass that to the kernel attacker. Let's also 
assume 
that the main crypto-keys of the kernel are protected by protection-keys. The 
code 
exposing the information leak will very likely have protection-key protected 
areas 
masked out, so the scope of the information leak is mitigated to a certain 
degree, 
the crypto keys are not readable.

Similarly, the pmem (persistent memory) driver could employ protection keys to 
keep terabytes of data 'masked out' most of the time - protecting data from 
kernel 
space memory corruption bugs.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-20 Thread Ingo Molnar

* Dave Hansen  wrote:

> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> +which will be found on future Intel CPUs.
> +
> +Memory Protection Keys provides a mechanism for enforcing page-based
> +protections, but without requiring modification of the page tables
> +when an application changes protection domains.  It works by
> +dedicating 4 previously ignored bits in each page table entry to a
> +"protection key", giving 16 possible keys.

Wondering how user-space is supposed to discover the number of protection keys,
is that CPUID leaf based, or hardcoded on the CPU feature bit?

> +There is also a new user-accessible register (PKRU) with two separate
> +bits (Access Disable and Write Disable) for each key.  Being a CPU
> +register, PKRU is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.
> +
> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> +to the new register.  The feature is only available in 64-bit mode,
> +even though there is theoretically space in the PAE PTEs.  These
> +permissions are enforced on data access only and have no effect on
> +instruction fetches.

Another question, related to enumeration as well: I'm wondering whether there's 
any way for the kernel to allocate a bit or two for its own purposes - such as 
protecting crypto keys? Or is the facility fundamentally intended for 
user-space 
use only?

Just a quick example: let's assume the kernel has an information leak hole, a 
way 
to read any kernel address and pass that to the kernel attacker. Let's also 
assume 
that the main crypto-keys of the kernel are protected by protection-keys. The 
code 
exposing the information leak will very likely have protection-key protected 
areas 
masked out, so the scope of the information leak is mitigated to a certain 
degree, 
the crypto keys are not readable.

Similarly, the pmem (persistent memory) driver could employ protection keys to 
keep terabytes of data 'masked out' most of the time - protecting data from 
kernel 
space memory corruption bugs.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 26/26] x86, pkeys: Documentation

2015-09-20 Thread Dave Hansen
On 09/20/2015 01:55 AM, Ingo Molnar wrote:
> * Dave Hansen  wrote:
>> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
>> +which will be found on future Intel CPUs.
>> +
>> +Memory Protection Keys provides a mechanism for enforcing page-based
>> +protections, but without requiring modification of the page tables
>> +when an application changes protection domains.  It works by
>> +dedicating 4 previously ignored bits in each page table entry to a
>> +"protection key", giving 16 possible keys.
> 
> Wondering how user-space is supposed to discover the number of protection 
> keys,
> is that CPUID leaf based, or hardcoded on the CPU feature bit?

The 16 keys are essentially hard-coded from the cpuid bit.

>> +There is also a new user-accessible register (PKRU) with two separate
>> +bits (Access Disable and Write Disable) for each key.  Being a CPU
>> +register, PKRU is inherently thread-local, potentially giving each
>> +thread a different set of protections from every other thread.
>> +
>> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
>> +to the new register.  The feature is only available in 64-bit mode,
>> +even though there is theoretically space in the PAE PTEs.  These
>> +permissions are enforced on data access only and have no effect on
>> +instruction fetches.
> 
> Another question, related to enumeration as well: I'm wondering whether 
> there's 
> any way for the kernel to allocate a bit or two for its own purposes - such 
> as 
> protecting crypto keys? Or is the facility fundamentally intended for 
> user-space 
> use only?

No, that's not possible with the current setup.

Userspace has complete control over the contents of the PKRU register
with unprivileged instructions.  So the kernel can not practically
protect any of its own data with this.

> Similarly, the pmem (persistent memory) driver could employ protection keys 
> to 
> keep terabytes of data 'masked out' most of the time - protecting data from 
> kernel 
> space memory corruption bugs.

I wish we could do this, but we can not with the current implementation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/