Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-04 Thread Rik van Riel
On Tue, 2016-10-04 at 08:35 +0200, Ingo Molnar wrote:
> * Andy Lutomirski  wrote:
> 
> > > We can get rid of fpu.counter, since nobody uses it
> > > any more.
> > 
> > We should definitely do this.
> > 
> > Maybe getting in some cleanups first (my lazy fpu deletion,
> > fpu.counter removal, etc) first is the way to go.
> 
> Could you guys please submit all pending FPU bits that aren't new
> functionality to -tip? I'll remove lazy FPU handling if you don't
> beat me to it.
> 
> Plus after lazy FPU removal we should take another good look and
> make the FPU state machine even more obvious and self-documenting.
> 
> Upstream merge commit 597f03f9d133 would be a good base.

I am flying out to Europe tomorrow, but will try to get
you the cleanups some time this week.  Besides the first
four patcehs from Andy, there is a shortlist of a few
more small changes.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-04 Thread Rik van Riel
On Tue, 2016-10-04 at 08:35 +0200, Ingo Molnar wrote:
> * Andy Lutomirski  wrote:
> 
> > > We can get rid of fpu.counter, since nobody uses it
> > > any more.
> > 
> > We should definitely do this.
> > 
> > Maybe getting in some cleanups first (my lazy fpu deletion,
> > fpu.counter removal, etc) first is the way to go.
> 
> Could you guys please submit all pending FPU bits that aren't new
> functionality to -tip? I'll remove lazy FPU handling if you don't
> beat me to it.
> 
> Plus after lazy FPU removal we should take another good look and
> make the FPU state machine even more obvious and self-documenting.
> 
> Upstream merge commit 597f03f9d133 would be a good base.

I am flying out to Europe tomorrow, but will try to get
you the cleanups some time this week.  Besides the first
four patcehs from Andy, there is a shortlist of a few
more small changes.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-04 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > Running the task in user space, would have to ensure
> > "registers valid" is true, and make "memory valid"
> > false, because userspace could write to the registers.
> >
> > Doing a ptrace fpstate_read would make "memory valid"
> > true, but does not need to invalidate "registers valid".
> >
> > Doing a ptrace fpstate_write would make "memory valid"
> > true, but would invalidate "registers valid".
> >
> > Context switching away from a task would make "memory
> > valid" true, by virtue of copying the fpstate to
> > memory.
> >
> > Ingo, would you have any objection to patches tracking
> > the validity of both register and memory states
> > independently?

I'd like to reserve judgement - but tentatively I see no red flags
at the moment, but in any case I'd like to start with:

> >
> > We can get rid of fpu.counter, since nobody uses it
> > any more.
> 
> We should definitely do this.
> 
> Maybe getting in some cleanups first (my lazy fpu deletion,
> fpu.counter removal, etc) first is the way to go.

Could you guys please submit all pending FPU bits that aren't new
functionality to -tip? I'll remove lazy FPU handling if you don't
beat me to it.

Plus after lazy FPU removal we should take another good look and
make the FPU state machine even more obvious and self-documenting.

Upstream merge commit 597f03f9d133 would be a good base.

Thanks,

Ingo


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-04 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > Running the task in user space, would have to ensure
> > "registers valid" is true, and make "memory valid"
> > false, because userspace could write to the registers.
> >
> > Doing a ptrace fpstate_read would make "memory valid"
> > true, but does not need to invalidate "registers valid".
> >
> > Doing a ptrace fpstate_write would make "memory valid"
> > true, but would invalidate "registers valid".
> >
> > Context switching away from a task would make "memory
> > valid" true, by virtue of copying the fpstate to
> > memory.
> >
> > Ingo, would you have any objection to patches tracking
> > the validity of both register and memory states
> > independently?

I'd like to reserve judgement - but tentatively I see no red flags
at the moment, but in any case I'd like to start with:

> >
> > We can get rid of fpu.counter, since nobody uses it
> > any more.
> 
> We should definitely do this.
> 
> Maybe getting in some cleanups first (my lazy fpu deletion,
> fpu.counter removal, etc) first is the way to go.

Could you guys please submit all pending FPU bits that aren't new
functionality to -tip? I'll remove lazy FPU handling if you don't
beat me to it.

Plus after lazy FPU removal we should take another good look and
make the FPU state machine even more obvious and self-documenting.

Upstream merge commit 597f03f9d133 would be a good base.

Thanks,

Ingo


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Mon, Oct 3, 2016 at 7:47 PM, Rik van Riel  wrote:
> On Mon, 2016-10-03 at 19:09 -0700, Andy Lutomirski wrote:
>
>> > Having two separate status booleans for "registers valid"
>> > and "memory valid" may make more sense.
>>
>> I have no problem with the concept of "owner_ctx", and I think it's a
>> perfectly reasonable data structure.  My problem with it is that it's
>> subtle and knowledge of it is spread all over the place.  Just going
>> with "registers valid" in a variable won't work, I think, because
>> there's nowhere to put it.  We need to be able to delete a struct fpu
>> while that struct fpu might have a valid copy in a different cpu's
>> registers.
>>
>> Anyway, feel free to tell me that I'm making this too difficult :)
>
> How about we rename fpu_want_lazy_restore to
> fpu_registers_valid()?  Problem solved :)
>
> Then we can rename __cpu_disable_lazy_restore
> to fpu_invalidate_registers(), and call that
> before we modify any in-memory FPU state.

Sounds good to me.

>
>> > We can get rid of fpu.counter, since nobody uses it
>> > any more.
>>
>> We should definitely do this.
>>
>> Maybe getting in some cleanups first (my lazy fpu deletion,
>> fpu.counter removal, etc) first is the way to go.
>
> Sounds good.  I will keep my patch 1/4 as part of the
> cleanup series, and will not move on to the harder
> stuff until after the cleanups.
>
> Any other stuff I should clean up while we're there?

Almost certainly, but nothing I'm thinking of right now :)

>
>> > > > >
>> > You are right, read_pkru() and write_pkru() can only deal with
>> > the pkru state being present in registers. Is this because of an
>> > assumption in the code, or because of a hardware requirement?
>
> read_pkru and write_pkru would be candidates for using
> fpu_registers_valid, and potentially a fpu_make_registers_valid,
> which restores the contents of the fpu registers from memory,
> if fpu_registers_valid is not true.
>
> Likewise, we can have an fpu_make_memory_valid to ensure the
> in kernel memory copy of the FPU registers is valid, potentially
> a _read and _write version that do exactly what the pstate code
> wants today.
>
> Would that make sense as an API?
>
> --
> All Rights Reversed.



-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Mon, Oct 3, 2016 at 7:47 PM, Rik van Riel  wrote:
> On Mon, 2016-10-03 at 19:09 -0700, Andy Lutomirski wrote:
>
>> > Having two separate status booleans for "registers valid"
>> > and "memory valid" may make more sense.
>>
>> I have no problem with the concept of "owner_ctx", and I think it's a
>> perfectly reasonable data structure.  My problem with it is that it's
>> subtle and knowledge of it is spread all over the place.  Just going
>> with "registers valid" in a variable won't work, I think, because
>> there's nowhere to put it.  We need to be able to delete a struct fpu
>> while that struct fpu might have a valid copy in a different cpu's
>> registers.
>>
>> Anyway, feel free to tell me that I'm making this too difficult :)
>
> How about we rename fpu_want_lazy_restore to
> fpu_registers_valid()?  Problem solved :)
>
> Then we can rename __cpu_disable_lazy_restore
> to fpu_invalidate_registers(), and call that
> before we modify any in-memory FPU state.

Sounds good to me.

>
>> > We can get rid of fpu.counter, since nobody uses it
>> > any more.
>>
>> We should definitely do this.
>>
>> Maybe getting in some cleanups first (my lazy fpu deletion,
>> fpu.counter removal, etc) first is the way to go.
>
> Sounds good.  I will keep my patch 1/4 as part of the
> cleanup series, and will not move on to the harder
> stuff until after the cleanups.
>
> Any other stuff I should clean up while we're there?

Almost certainly, but nothing I'm thinking of right now :)

>
>> > > > >
>> > You are right, read_pkru() and write_pkru() can only deal with
>> > the pkru state being present in registers. Is this because of an
>> > assumption in the code, or because of a hardware requirement?
>
> read_pkru and write_pkru would be candidates for using
> fpu_registers_valid, and potentially a fpu_make_registers_valid,
> which restores the contents of the fpu registers from memory,
> if fpu_registers_valid is not true.
>
> Likewise, we can have an fpu_make_memory_valid to ensure the
> in kernel memory copy of the FPU registers is valid, potentially
> a _read and _write version that do exactly what the pstate code
> wants today.
>
> Would that make sense as an API?
>
> --
> All Rights Reversed.



-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Oct 3, 2016 7:11 PM, "Rik van Riel"  wrote:
>
> On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> >
> > Anything else that tries to read task xstate from memory, i.e. MPX
> > and
> > PKRU.  (Although if we switch to eager-switched PKRU, then PKRU stops
> > mattering for this purpose.)
> >
> > Actually, I don't see any way your patches can be compatible with
> > PKRU
> > without switching to eager-switched PKRU.
>
> There is one case where the in-register PKRU state matters:
> - user space accesses to memory
>
> There are several cases where the in-memory PKRU state would
> suffice:
> - get_user_pages(_fast) to the local task (could also use registers)
> - setting VMA/PTE permission bits (could also use registers)
>
> There is one case where only in-memory PKRU state works, where
> PKRU is currently simply ignored:
> - get_user_pages to another task's memory

Also __get_user, etc.  I don't think you want to start playing with
TIF_LOAD_FPU there.  I think tracking PKRU separately and eagerly
loading it with WRPKRU may be the only decent choice here.

>
> Dave, are there major obstacles to making read_pkru and write_pkru
> work with in-memory state?
>
> Would it be better for read/write_pkru to force the FPU state
> to get loaded into registers, under the assumption that if things
> like get_user_pages_fast happens, we will likely switch back to
> userspace soon, anyway?
>
> Would that assumption be wrong with KVM? :)

>
> --
> All Rights Reversed.


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Oct 3, 2016 7:11 PM, "Rik van Riel"  wrote:
>
> On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> >
> > Anything else that tries to read task xstate from memory, i.e. MPX
> > and
> > PKRU.  (Although if we switch to eager-switched PKRU, then PKRU stops
> > mattering for this purpose.)
> >
> > Actually, I don't see any way your patches can be compatible with
> > PKRU
> > without switching to eager-switched PKRU.
>
> There is one case where the in-register PKRU state matters:
> - user space accesses to memory
>
> There are several cases where the in-memory PKRU state would
> suffice:
> - get_user_pages(_fast) to the local task (could also use registers)
> - setting VMA/PTE permission bits (could also use registers)
>
> There is one case where only in-memory PKRU state works, where
> PKRU is currently simply ignored:
> - get_user_pages to another task's memory

Also __get_user, etc.  I don't think you want to start playing with
TIF_LOAD_FPU there.  I think tracking PKRU separately and eagerly
loading it with WRPKRU may be the only decent choice here.

>
> Dave, are there major obstacles to making read_pkru and write_pkru
> work with in-memory state?
>
> Would it be better for read/write_pkru to force the FPU state
> to get loaded into registers, under the assumption that if things
> like get_user_pages_fast happens, we will likely switch back to
> userspace soon, anyway?
>
> Would that assumption be wrong with KVM? :)

>
> --
> All Rights Reversed.


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 19:09 -0700, Andy Lutomirski wrote:

> > Having two separate status booleans for "registers valid"
> > and "memory valid" may make more sense.
> 
> I have no problem with the concept of "owner_ctx", and I think it's a
> perfectly reasonable data structure.  My problem with it is that it's
> subtle and knowledge of it is spread all over the place.  Just going
> with "registers valid" in a variable won't work, I think, because
> there's nowhere to put it.  We need to be able to delete a struct fpu
> while that struct fpu might have a valid copy in a different cpu's
> registers.
> 
> Anyway, feel free to tell me that I'm making this too difficult :)

How about we rename fpu_want_lazy_restore to
fpu_registers_valid()?  Problem solved :)

Then we can rename __cpu_disable_lazy_restore
to fpu_invalidate_registers(), and call that
before we modify any in-memory FPU state.

> > We can get rid of fpu.counter, since nobody uses it
> > any more.
> 
> We should definitely do this.
> 
> Maybe getting in some cleanups first (my lazy fpu deletion,
> fpu.counter removal, etc) first is the way to go.

Sounds good.  I will keep my patch 1/4 as part of the
cleanup series, and will not move on to the harder
stuff until after the cleanups.

Any other stuff I should clean up while we're there?

> > > > > 
> > You are right, read_pkru() and write_pkru() can only deal with
> > the pkru state being present in registers. Is this because of an
> > assumption in the code, or because of a hardware requirement?

read_pkru and write_pkru would be candidates for using
fpu_registers_valid, and potentially a fpu_make_registers_valid,
which restores the contents of the fpu registers from memory,
if fpu_registers_valid is not true.

Likewise, we can have an fpu_make_memory_valid to ensure the
in kernel memory copy of the FPU registers is valid, potentially
a _read and _write version that do exactly what the pstate code
wants today.

Would that make sense as an API?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 19:09 -0700, Andy Lutomirski wrote:

> > Having two separate status booleans for "registers valid"
> > and "memory valid" may make more sense.
> 
> I have no problem with the concept of "owner_ctx", and I think it's a
> perfectly reasonable data structure.  My problem with it is that it's
> subtle and knowledge of it is spread all over the place.  Just going
> with "registers valid" in a variable won't work, I think, because
> there's nowhere to put it.  We need to be able to delete a struct fpu
> while that struct fpu might have a valid copy in a different cpu's
> registers.
> 
> Anyway, feel free to tell me that I'm making this too difficult :)

How about we rename fpu_want_lazy_restore to
fpu_registers_valid()?  Problem solved :)

Then we can rename __cpu_disable_lazy_restore
to fpu_invalidate_registers(), and call that
before we modify any in-memory FPU state.

> > We can get rid of fpu.counter, since nobody uses it
> > any more.
> 
> We should definitely do this.
> 
> Maybe getting in some cleanups first (my lazy fpu deletion,
> fpu.counter removal, etc) first is the way to go.

Sounds good.  I will keep my patch 1/4 as part of the
cleanup series, and will not move on to the harder
stuff until after the cleanups.

Any other stuff I should clean up while we're there?

> > > > > 
> > You are right, read_pkru() and write_pkru() can only deal with
> > the pkru state being present in registers. Is this because of an
> > assumption in the code, or because of a hardware requirement?

read_pkru and write_pkru would be candidates for using
fpu_registers_valid, and potentially a fpu_make_registers_valid,
which restores the contents of the fpu registers from memory,
if fpu_registers_valid is not true.

Likewise, we can have an fpu_make_memory_valid to ensure the
in kernel memory copy of the FPU registers is valid, potentially
a _read and _write version that do exactly what the pstate code
wants today.

Would that make sense as an API?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> 
> Anything else that tries to read task xstate from memory, i.e. MPX
> and
> PKRU.  (Although if we switch to eager-switched PKRU, then PKRU stops
> mattering for this purpose.)
> 
> Actually, I don't see any way your patches can be compatible with
> PKRU
> without switching to eager-switched PKRU.

There is one case where the in-register PKRU state matters:
- user space accesses to memory

There are several cases where the in-memory PKRU state would
suffice:
- get_user_pages(_fast) to the local task (could also use registers)
- setting VMA/PTE permission bits (could also use registers)

There is one case where only in-memory PKRU state works, where
PKRU is currently simply ignored:
- get_user_pages to another task's memory

Dave, are there major obstacles to making read_pkru and write_pkru
work with in-memory state?

Would it be better for read/write_pkru to force the FPU state
to get loaded into registers, under the assumption that if things
like get_user_pages_fast happens, we will likely switch back to
userspace soon, anyway?

Would that assumption be wrong with KVM? :)

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> 
> Anything else that tries to read task xstate from memory, i.e. MPX
> and
> PKRU.  (Although if we switch to eager-switched PKRU, then PKRU stops
> mattering for this purpose.)
> 
> Actually, I don't see any way your patches can be compatible with
> PKRU
> without switching to eager-switched PKRU.

There is one case where the in-register PKRU state matters:
- user space accesses to memory

There are several cases where the in-memory PKRU state would
suffice:
- get_user_pages(_fast) to the local task (could also use registers)
- setting VMA/PTE permission bits (could also use registers)

There is one case where only in-memory PKRU state works, where
PKRU is currently simply ignored:
- get_user_pages to another task's memory

Dave, are there major obstacles to making read_pkru and write_pkru
work with in-memory state?

Would it be better for read/write_pkru to force the FPU state
to get loaded into registers, under the assumption that if things
like get_user_pages_fast happens, we will likely switch back to
userspace soon, anyway?

Would that assumption be wrong with KVM? :)

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Mon, Oct 3, 2016 at 6:29 PM, Rik van Riel  wrote:
> On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
>> On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel  wrote:
>> >
>> > >
>> > > One of my objections to the current code structure is that I find
>> > > it
>> > > quite hard to keep track of all of the possible states of the fpu
>> > > that
>> > > we have.  Off the top of my head, we have:
>> > >
>> > > 1. In-cpu state is valid and in-memory state is invalid.  (This
>> > > is
>> > > the
>> > > case when we're running user code, for example.)
>> > > 2. In-memory state is valid.
>> > > 2a. In-cpu state is *also* valid for some particular task.  In
>> > > this
>> > > state, no one may write to either FPU state copy for that task
>> > > lest
>> > > they get out of sync.
>> > > 2b. In-cpu state is not valid.
>> >
>> > It gets more fun with ptrace. Look at xfpregs_get
>> > and xfpregs_set, which call fpu__activate_fpstate_read
>> > and fpu__activate_fpstate_write respectively.
>> >
>> > It looks like those functions already deal with the
>> > cases you outline above.
>>
>> Hmm, maybe they could be used.  The names are IMO atrocious.  I had
>> to
>> read the docs and the code to figure out WTF "activating" the
>> "fpstate" even meant.
>
> If you have good naming ideas, I'd be willing to submit
> patches to rename them :)
>
>> > >
>> > > Rik, your patches further complicate this by making it possible
>> > > to
>> > > have the in-cpu state be valid for a non-current task even when
>> > > non-idle.
>> >
>> > However, we never need to examine the in-cpu state for
>> > a task that is not currently running, or being scheduled
>> > to in the context switch code.
>> >
>> > We always save the FPU register state when a task is
>> > context switched out, so for a non-current task, we
>> > still only need to consider whether the in-memory state
>> > is valid or not.
>> >
>> > The in-cpu state does not matter.
>> >
>> > fpu__activate_fpstate_write will disable lazy restore,
>> > and ensure a full restore from memory.
>>
>> Something has to fix up owner_ctx so that the previous thread doesn't
>> think its CPU state is still valid.  I don't think
>> activate_fpstate_write does it, because nothing should (I think) be
>> calling it as a mattor of course on context switch.
>
> These functions are called on tasks that are ptraced,
> and currently stopped. fpu__activate_fpstate_write
> sets fpu->last_cpu to -1, which will cause the next
> context switch to the task to load the fpstate into
> the registers.
>
> The status in the struct fpu keeps track of the status
> of things, though maybe not as explicitly as you would
> want.
>
> Having two separate status booleans for "registers valid"
> and "memory valid" may make more sense.

I have no problem with the concept of "owner_ctx", and I think it's a
perfectly reasonable data structure.  My problem with it is that it's
subtle and knowledge of it is spread all over the place.  Just going
with "registers valid" in a variable won't work, I think, because
there's nowhere to put it.  We need to be able to delete a struct fpu
while that struct fpu might have a valid copy in a different cpu's
registers.

Anyway, feel free to tell me that I'm making this too difficult :)

>
> Running the task in user space, would have to ensure
> "registers valid" is true, and make "memory valid"
> false, because userspace could write to the registers.
>
> Doing a ptrace fpstate_read would make "memory valid"
> true, but does not need to invalidate "registers valid".
>
> Doing a ptrace fpstate_write would make "memory valid"
> true, but would invalidate "registers valid".
>
> Context switching away from a task would make "memory
> valid" true, by virtue of copying the fpstate to
> memory.
>
> Ingo, would you have any objection to patches tracking
> the validity of both register and memory states
> independently?
>
> We can get rid of fpu.counter, since nobody uses it
> any more.

We should definitely do this.

Maybe getting in some cleanups first (my lazy fpu deletion,
fpu.counter removal, etc) first is the way to go.

>
>> > >  This is why I'm thinking that we should maybe revisit the
>> > > API a bit to make this clearer and to avoid scattering around all
>> > > the
>> > > state manipulation quite so much.  I propose, as a straw-man:
>> > >
>> > > user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> > > current.
>> > > After calling this, we're in state 1 or 2a.
>> > >
>> > > user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> > > writable
>> > > for current.  Invalidates any in-memory copy.  After calling
>> > > this,
>> > > we're in state 1.
>> >
>> > How can we distinguish between these two?
>> >
>> > If the task is running, it can change its FPU
>> > registers at any time, without any obvious thing
>> > to mark the transition from state 2a to state 1.
>>
>> That's true now, but I don't think it's really true any more with
>> your
>> 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Mon, Oct 3, 2016 at 6:29 PM, Rik van Riel  wrote:
> On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
>> On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel  wrote:
>> >
>> > >
>> > > One of my objections to the current code structure is that I find
>> > > it
>> > > quite hard to keep track of all of the possible states of the fpu
>> > > that
>> > > we have.  Off the top of my head, we have:
>> > >
>> > > 1. In-cpu state is valid and in-memory state is invalid.  (This
>> > > is
>> > > the
>> > > case when we're running user code, for example.)
>> > > 2. In-memory state is valid.
>> > > 2a. In-cpu state is *also* valid for some particular task.  In
>> > > this
>> > > state, no one may write to either FPU state copy for that task
>> > > lest
>> > > they get out of sync.
>> > > 2b. In-cpu state is not valid.
>> >
>> > It gets more fun with ptrace. Look at xfpregs_get
>> > and xfpregs_set, which call fpu__activate_fpstate_read
>> > and fpu__activate_fpstate_write respectively.
>> >
>> > It looks like those functions already deal with the
>> > cases you outline above.
>>
>> Hmm, maybe they could be used.  The names are IMO atrocious.  I had
>> to
>> read the docs and the code to figure out WTF "activating" the
>> "fpstate" even meant.
>
> If you have good naming ideas, I'd be willing to submit
> patches to rename them :)
>
>> > >
>> > > Rik, your patches further complicate this by making it possible
>> > > to
>> > > have the in-cpu state be valid for a non-current task even when
>> > > non-idle.
>> >
>> > However, we never need to examine the in-cpu state for
>> > a task that is not currently running, or being scheduled
>> > to in the context switch code.
>> >
>> > We always save the FPU register state when a task is
>> > context switched out, so for a non-current task, we
>> > still only need to consider whether the in-memory state
>> > is valid or not.
>> >
>> > The in-cpu state does not matter.
>> >
>> > fpu__activate_fpstate_write will disable lazy restore,
>> > and ensure a full restore from memory.
>>
>> Something has to fix up owner_ctx so that the previous thread doesn't
>> think its CPU state is still valid.  I don't think
>> activate_fpstate_write does it, because nothing should (I think) be
>> calling it as a mattor of course on context switch.
>
> These functions are called on tasks that are ptraced,
> and currently stopped. fpu__activate_fpstate_write
> sets fpu->last_cpu to -1, which will cause the next
> context switch to the task to load the fpstate into
> the registers.
>
> The status in the struct fpu keeps track of the status
> of things, though maybe not as explicitly as you would
> want.
>
> Having two separate status booleans for "registers valid"
> and "memory valid" may make more sense.

I have no problem with the concept of "owner_ctx", and I think it's a
perfectly reasonable data structure.  My problem with it is that it's
subtle and knowledge of it is spread all over the place.  Just going
with "registers valid" in a variable won't work, I think, because
there's nowhere to put it.  We need to be able to delete a struct fpu
while that struct fpu might have a valid copy in a different cpu's
registers.

Anyway, feel free to tell me that I'm making this too difficult :)

>
> Running the task in user space, would have to ensure
> "registers valid" is true, and make "memory valid"
> false, because userspace could write to the registers.
>
> Doing a ptrace fpstate_read would make "memory valid"
> true, but does not need to invalidate "registers valid".
>
> Doing a ptrace fpstate_write would make "memory valid"
> true, but would invalidate "registers valid".
>
> Context switching away from a task would make "memory
> valid" true, by virtue of copying the fpstate to
> memory.
>
> Ingo, would you have any objection to patches tracking
> the validity of both register and memory states
> independently?
>
> We can get rid of fpu.counter, since nobody uses it
> any more.

We should definitely do this.

Maybe getting in some cleanups first (my lazy fpu deletion,
fpu.counter removal, etc) first is the way to go.

>
>> > >  This is why I'm thinking that we should maybe revisit the
>> > > API a bit to make this clearer and to avoid scattering around all
>> > > the
>> > > state manipulation quite so much.  I propose, as a straw-man:
>> > >
>> > > user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> > > current.
>> > > After calling this, we're in state 1 or 2a.
>> > >
>> > > user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> > > writable
>> > > for current.  Invalidates any in-memory copy.  After calling
>> > > this,
>> > > we're in state 1.
>> >
>> > How can we distinguish between these two?
>> >
>> > If the task is running, it can change its FPU
>> > registers at any time, without any obvious thing
>> > to mark the transition from state 2a to state 1.
>>
>> That's true now, but I don't think it's really true any more with
>> your
>> patches.  If a task is running in 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel  wrote:
> > 
> > > 
> > > One of my objections to the current code structure is that I find
> > > it
> > > quite hard to keep track of all of the possible states of the fpu
> > > that
> > > we have.  Off the top of my head, we have:
> > > 
> > > 1. In-cpu state is valid and in-memory state is invalid.  (This
> > > is
> > > the
> > > case when we're running user code, for example.)
> > > 2. In-memory state is valid.
> > > 2a. In-cpu state is *also* valid for some particular task.  In
> > > this
> > > state, no one may write to either FPU state copy for that task
> > > lest
> > > they get out of sync.
> > > 2b. In-cpu state is not valid.
> > 
> > It gets more fun with ptrace. Look at xfpregs_get
> > and xfpregs_set, which call fpu__activate_fpstate_read
> > and fpu__activate_fpstate_write respectively.
> > 
> > It looks like those functions already deal with the
> > cases you outline above.
> 
> Hmm, maybe they could be used.  The names are IMO atrocious.  I had
> to
> read the docs and the code to figure out WTF "activating" the
> "fpstate" even meant.

If you have good naming ideas, I'd be willing to submit
patches to rename them :)

> > > 
> > > Rik, your patches further complicate this by making it possible
> > > to
> > > have the in-cpu state be valid for a non-current task even when
> > > non-idle.
> > 
> > However, we never need to examine the in-cpu state for
> > a task that is not currently running, or being scheduled
> > to in the context switch code.
> > 
> > We always save the FPU register state when a task is
> > context switched out, so for a non-current task, we
> > still only need to consider whether the in-memory state
> > is valid or not.
> > 
> > The in-cpu state does not matter.
> > 
> > fpu__activate_fpstate_write will disable lazy restore,
> > and ensure a full restore from memory.
> 
> Something has to fix up owner_ctx so that the previous thread doesn't
> think its CPU state is still valid.  I don't think
> activate_fpstate_write does it, because nothing should (I think) be
> calling it as a mattor of course on context switch.

These functions are called on tasks that are ptraced,
and currently stopped. fpu__activate_fpstate_write
sets fpu->last_cpu to -1, which will cause the next
context switch to the task to load the fpstate into
the registers.

The status in the struct fpu keeps track of the status
of things, though maybe not as explicitly as you would
want.

Having two separate status booleans for "registers valid"
and "memory valid" may make more sense.

Running the task in user space, would have to ensure
"registers valid" is true, and make "memory valid"
false, because userspace could write to the registers.

Doing a ptrace fpstate_read would make "memory valid"
true, but does not need to invalidate "registers valid".

Doing a ptrace fpstate_write would make "memory valid"
true, but would invalidate "registers valid".

Context switching away from a task would make "memory
valid" true, by virtue of copying the fpstate to
memory.

Ingo, would you have any objection to patches tracking
the validity of both register and memory states
independently?

We can get rid of fpu.counter, since nobody uses it
any more.

> > >  This is why I'm thinking that we should maybe revisit the
> > > API a bit to make this clearer and to avoid scattering around all
> > > the
> > > state manipulation quite so much.  I propose, as a straw-man:
> > > 
> > > user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
> > > current.
> > > After calling this, we're in state 1 or 2a.
> > > 
> > > user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
> > > writable
> > > for current.  Invalidates any in-memory copy.  After calling
> > > this,
> > > we're in state 1.
> > 
> > How can we distinguish between these two?
> > 
> > If the task is running, it can change its FPU
> > registers at any time, without any obvious thing
> > to mark the transition from state 2a to state 1.
> 
> That's true now, but I don't think it's really true any more with
> your
> patches.  If a task is running in *kernel* mode, then I think that
> pretty much all of the combinations are possible.  When the task
> actually enters user mode, we have to make sure we're in state 1 so
> the CPU regs belong to the task and are writable.

That is fair.

> > It sounds like the first function you name would
> > only be useful if we prevented a switch to user
> > space, which could immediately do whatever it wants
> > with the registers.
> > 
> > Is there a use case for user_fpregs_copy_to_cpu()?
> 
> With PKRU, any user memory access would need
> user_fpregs_copy_to_cpu().  I think we should get rid of this and
> just
> switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
> context switch to avoid this.  Other than that, I can't think of any
> use cases off the top of my head unless the signal code 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel  wrote:
> > 
> > > 
> > > One of my objections to the current code structure is that I find
> > > it
> > > quite hard to keep track of all of the possible states of the fpu
> > > that
> > > we have.  Off the top of my head, we have:
> > > 
> > > 1. In-cpu state is valid and in-memory state is invalid.  (This
> > > is
> > > the
> > > case when we're running user code, for example.)
> > > 2. In-memory state is valid.
> > > 2a. In-cpu state is *also* valid for some particular task.  In
> > > this
> > > state, no one may write to either FPU state copy for that task
> > > lest
> > > they get out of sync.
> > > 2b. In-cpu state is not valid.
> > 
> > It gets more fun with ptrace. Look at xfpregs_get
> > and xfpregs_set, which call fpu__activate_fpstate_read
> > and fpu__activate_fpstate_write respectively.
> > 
> > It looks like those functions already deal with the
> > cases you outline above.
> 
> Hmm, maybe they could be used.  The names are IMO atrocious.  I had
> to
> read the docs and the code to figure out WTF "activating" the
> "fpstate" even meant.

If you have good naming ideas, I'd be willing to submit
patches to rename them :)

> > > 
> > > Rik, your patches further complicate this by making it possible
> > > to
> > > have the in-cpu state be valid for a non-current task even when
> > > non-idle.
> > 
> > However, we never need to examine the in-cpu state for
> > a task that is not currently running, or being scheduled
> > to in the context switch code.
> > 
> > We always save the FPU register state when a task is
> > context switched out, so for a non-current task, we
> > still only need to consider whether the in-memory state
> > is valid or not.
> > 
> > The in-cpu state does not matter.
> > 
> > fpu__activate_fpstate_write will disable lazy restore,
> > and ensure a full restore from memory.
> 
> Something has to fix up owner_ctx so that the previous thread doesn't
> think its CPU state is still valid.  I don't think
> activate_fpstate_write does it, because nothing should (I think) be
> calling it as a mattor of course on context switch.

These functions are called on tasks that are ptraced,
and currently stopped. fpu__activate_fpstate_write
sets fpu->last_cpu to -1, which will cause the next
context switch to the task to load the fpstate into
the registers.

The status in the struct fpu keeps track of the status
of things, though maybe not as explicitly as you would
want.

Having two separate status booleans for "registers valid"
and "memory valid" may make more sense.

Running the task in user space, would have to ensure
"registers valid" is true, and make "memory valid"
false, because userspace could write to the registers.

Doing a ptrace fpstate_read would make "memory valid"
true, but does not need to invalidate "registers valid".

Doing a ptrace fpstate_write would make "memory valid"
true, but would invalidate "registers valid".

Context switching away from a task would make "memory
valid" true, by virtue of copying the fpstate to
memory.

Ingo, would you have any objection to patches tracking
the validity of both register and memory states
independently?

We can get rid of fpu.counter, since nobody uses it
any more.

> > >  This is why I'm thinking that we should maybe revisit the
> > > API a bit to make this clearer and to avoid scattering around all
> > > the
> > > state manipulation quite so much.  I propose, as a straw-man:
> > > 
> > > user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
> > > current.
> > > After calling this, we're in state 1 or 2a.
> > > 
> > > user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
> > > writable
> > > for current.  Invalidates any in-memory copy.  After calling
> > > this,
> > > we're in state 1.
> > 
> > How can we distinguish between these two?
> > 
> > If the task is running, it can change its FPU
> > registers at any time, without any obvious thing
> > to mark the transition from state 2a to state 1.
> 
> That's true now, but I don't think it's really true any more with
> your
> patches.  If a task is running in *kernel* mode, then I think that
> pretty much all of the combinations are possible.  When the task
> actually enters user mode, we have to make sure we're in state 1 so
> the CPU regs belong to the task and are writable.

That is fair.

> > It sounds like the first function you name would
> > only be useful if we prevented a switch to user
> > space, which could immediately do whatever it wants
> > with the registers.
> > 
> > Is there a use case for user_fpregs_copy_to_cpu()?
> 
> With PKRU, any user memory access would need
> user_fpregs_copy_to_cpu().  I think we should get rid of this and
> just
> switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
> context switch to avoid this.  Other than that, I can't think of any
> use cases off the top of my head unless the signal code wanted to
> play
> 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel  wrote:
> On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel  wrote:
>> >
>> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > >
>> > > On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
>> > > >
>> > > >
>> > > >  #define CREATE_TRACE_POINTS
>> > > >  #include 
>> > > > @@ -197,6 +198,9 @@ __visible inline void
>> > > > prepare_exit_to_usermode(struct pt_regs *regs)
>> > > > if (unlikely(cached_flags &
>> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
>> > > > exit_to_usermode_loop(regs, cached_flags);
>> > > >
>> > > > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > > > +   switch_fpu_return();
>> > > > +
>> > > How about:
>> > >
>> > > if (unlikely(...)) {
>> > >   exit_to_usermode_loop(regs, cached_flags);
>> > >   cached_flags = READ_ONCE(ti->flags);

Like this?

>> > > }
>> > >
>> > > if (ti->flags & _TIF_LOAD_FPU) {
>> > >   clear_thread_flag(TIF_LOAD_FPU);
>> > >   switch_fpu_return();
>> > > }
>> > It looks like test_thread_flag should be fine for avoiding
>> > atomics, too?
>> >
>> >   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>> >   clear_thread_flag(TIF_LOAD_FPU);
>> >   switch_fpu_return();
>> >   }
>> True, but we've got that cached_flags variable already...
>
> The cached flag may no longer be valid after
> exit_to_usermode_loop() returns, because that
> code is preemptible.
>
> We have to reload a fresh ti->flags after
> preemption is disabled and irqs are off.

See above...

>
>> One of my objections to the current code structure is that I find it
>> quite hard to keep track of all of the possible states of the fpu
>> that
>> we have.  Off the top of my head, we have:
>>
>> 1. In-cpu state is valid and in-memory state is invalid.  (This is
>> the
>> case when we're running user code, for example.)
>> 2. In-memory state is valid.
>> 2a. In-cpu state is *also* valid for some particular task.  In this
>> state, no one may write to either FPU state copy for that task lest
>> they get out of sync.
>> 2b. In-cpu state is not valid.
>
> It gets more fun with ptrace. Look at xfpregs_get
> and xfpregs_set, which call fpu__activate_fpstate_read
> and fpu__activate_fpstate_write respectively.
>
> It looks like those functions already deal with the
> cases you outline above.

Hmm, maybe they could be used.  The names are IMO atrocious.  I had to
read the docs and the code to figure out WTF "activating" the
"fpstate" even meant.

>
>> Rik, your patches further complicate this by making it possible to
>> have the in-cpu state be valid for a non-current task even when
>> non-idle.
>
> However, we never need to examine the in-cpu state for
> a task that is not currently running, or being scheduled
> to in the context switch code.
>
> We always save the FPU register state when a task is
> context switched out, so for a non-current task, we
> still only need to consider whether the in-memory state
> is valid or not.
>
> The in-cpu state does not matter.
>
> fpu__activate_fpstate_write will disable lazy restore,
> and ensure a full restore from memory.

Something has to fix up owner_ctx so that the previous thread doesn't
think its CPU state is still valid.  I don't think
activate_fpstate_write does it, because nothing should (I think) be
calling it as a mattor of course on context switch.

>
>>  This is why I'm thinking that we should maybe revisit the
>> API a bit to make this clearer and to avoid scattering around all the
>> state manipulation quite so much.  I propose, as a straw-man:
>>
>> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> current.
>> After calling this, we're in state 1 or 2a.
>>
>> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> writable
>> for current.  Invalidates any in-memory copy.  After calling this,
>> we're in state 1.
>
> How can we distinguish between these two?
>
> If the task is running, it can change its FPU
> registers at any time, without any obvious thing
> to mark the transition from state 2a to state 1.

That's true now, but I don't think it's really true any more with your
patches.  If a task is running in *kernel* mode, then I think that
pretty much all of the combinations are possible.  When the task
actually enters user mode, we have to make sure we're in state 1 so
the CPU regs belong to the task and are writable.

>
> It sounds like the first function you name would
> only be useful if we prevented a switch to user
> space, which could immediately do whatever it wants
> with the registers.
>
> Is there a use case for user_fpregs_copy_to_cpu()?

With PKRU, any user memory access would need
user_fpregs_copy_to_cpu().  I think we should get rid of this and just
switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
context switch to avoid this.  Other than that, I can't think of any
use cases off the top of my 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel  wrote:
> On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel  wrote:
>> >
>> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > >
>> > > On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
>> > > >
>> > > >
>> > > >  #define CREATE_TRACE_POINTS
>> > > >  #include 
>> > > > @@ -197,6 +198,9 @@ __visible inline void
>> > > > prepare_exit_to_usermode(struct pt_regs *regs)
>> > > > if (unlikely(cached_flags &
>> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
>> > > > exit_to_usermode_loop(regs, cached_flags);
>> > > >
>> > > > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > > > +   switch_fpu_return();
>> > > > +
>> > > How about:
>> > >
>> > > if (unlikely(...)) {
>> > >   exit_to_usermode_loop(regs, cached_flags);
>> > >   cached_flags = READ_ONCE(ti->flags);

Like this?

>> > > }
>> > >
>> > > if (ti->flags & _TIF_LOAD_FPU) {
>> > >   clear_thread_flag(TIF_LOAD_FPU);
>> > >   switch_fpu_return();
>> > > }
>> > It looks like test_thread_flag should be fine for avoiding
>> > atomics, too?
>> >
>> >   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>> >   clear_thread_flag(TIF_LOAD_FPU);
>> >   switch_fpu_return();
>> >   }
>> True, but we've got that cached_flags variable already...
>
> The cached flag may no longer be valid after
> exit_to_usermode_loop() returns, because that
> code is preemptible.
>
> We have to reload a fresh ti->flags after
> preemption is disabled and irqs are off.

See above...

>
>> One of my objections to the current code structure is that I find it
>> quite hard to keep track of all of the possible states of the fpu
>> that
>> we have.  Off the top of my head, we have:
>>
>> 1. In-cpu state is valid and in-memory state is invalid.  (This is
>> the
>> case when we're running user code, for example.)
>> 2. In-memory state is valid.
>> 2a. In-cpu state is *also* valid for some particular task.  In this
>> state, no one may write to either FPU state copy for that task lest
>> they get out of sync.
>> 2b. In-cpu state is not valid.
>
> It gets more fun with ptrace. Look at xfpregs_get
> and xfpregs_set, which call fpu__activate_fpstate_read
> and fpu__activate_fpstate_write respectively.
>
> It looks like those functions already deal with the
> cases you outline above.

Hmm, maybe they could be used.  The names are IMO atrocious.  I had to
read the docs and the code to figure out WTF "activating" the
"fpstate" even meant.

>
>> Rik, your patches further complicate this by making it possible to
>> have the in-cpu state be valid for a non-current task even when
>> non-idle.
>
> However, we never need to examine the in-cpu state for
> a task that is not currently running, or being scheduled
> to in the context switch code.
>
> We always save the FPU register state when a task is
> context switched out, so for a non-current task, we
> still only need to consider whether the in-memory state
> is valid or not.
>
> The in-cpu state does not matter.
>
> fpu__activate_fpstate_write will disable lazy restore,
> and ensure a full restore from memory.

Something has to fix up owner_ctx so that the previous thread doesn't
think its CPU state is still valid.  I don't think
activate_fpstate_write does it, because nothing should (I think) be
calling it as a mattor of course on context switch.

>
>>  This is why I'm thinking that we should maybe revisit the
>> API a bit to make this clearer and to avoid scattering around all the
>> state manipulation quite so much.  I propose, as a straw-man:
>>
>> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> current.
>> After calling this, we're in state 1 or 2a.
>>
>> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> writable
>> for current.  Invalidates any in-memory copy.  After calling this,
>> we're in state 1.
>
> How can we distinguish between these two?
>
> If the task is running, it can change its FPU
> registers at any time, without any obvious thing
> to mark the transition from state 2a to state 1.

That's true now, but I don't think it's really true any more with your
patches.  If a task is running in *kernel* mode, then I think that
pretty much all of the combinations are possible.  When the task
actually enters user mode, we have to make sure we're in state 1 so
the CPU regs belong to the task and are writable.

>
> It sounds like the first function you name would
> only be useful if we prevented a switch to user
> space, which could immediately do whatever it wants
> with the registers.
>
> Is there a use case for user_fpregs_copy_to_cpu()?

With PKRU, any user memory access would need
user_fpregs_copy_to_cpu().  I think we should get rid of this and just
switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
context switch to avoid this.  Other than that, I can't think of any
use cases off the top of my head unless the signal code wanted to play
games 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel  wrote:
> > 
> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> > > 
> > > On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> > > > 
> > > > 
> > > >  #define CREATE_TRACE_POINTS
> > > >  #include 
> > > > @@ -197,6 +198,9 @@ __visible inline void
> > > > prepare_exit_to_usermode(struct pt_regs *regs)
> > > > if (unlikely(cached_flags &
> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
> > > > exit_to_usermode_loop(regs, cached_flags);
> > > > 
> > > > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> > > > +   switch_fpu_return();
> > > > +
> > > How about:
> > > 
> > > if (unlikely(...)) {
> > >   exit_to_usermode_loop(regs, cached_flags);
> > >   cached_flags = READ_ONCE(ti->flags);
> > > }
> > > 
> > > if (ti->flags & _TIF_LOAD_FPU) {
> > >   clear_thread_flag(TIF_LOAD_FPU);
> > >   switch_fpu_return();
> > > }
> > It looks like test_thread_flag should be fine for avoiding
> > atomics, too?
> > 
> >   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
> >   clear_thread_flag(TIF_LOAD_FPU);
> >   switch_fpu_return();
> >   }
> True, but we've got that cached_flags variable already...

The cached flag may no longer be valid after
exit_to_usermode_loop() returns, because that
code is preemptible.

We have to reload a fresh ti->flags after 
preemption is disabled and irqs are off.

> One of my objections to the current code structure is that I find it
> quite hard to keep track of all of the possible states of the fpu
> that
> we have.  Off the top of my head, we have:
> 
> 1. In-cpu state is valid and in-memory state is invalid.  (This is
> the
> case when we're running user code, for example.)
> 2. In-memory state is valid.
> 2a. In-cpu state is *also* valid for some particular task.  In this
> state, no one may write to either FPU state copy for that task lest
> they get out of sync.
> 2b. In-cpu state is not valid.

It gets more fun with ptrace. Look at xfpregs_get
and xfpregs_set, which call fpu__activate_fpstate_read
and fpu__activate_fpstate_write respectively.

It looks like those functions already deal with the
cases you outline above.

> Rik, your patches further complicate this by making it possible to
> have the in-cpu state be valid for a non-current task even when
> non-idle. 

However, we never need to examine the in-cpu state for
a task that is not currently running, or being scheduled
to in the context switch code.

We always save the FPU register state when a task is
context switched out, so for a non-current task, we
still only need to consider whether the in-memory state
is valid or not.

The in-cpu state does not matter.

fpu__activate_fpstate_write will disable lazy restore,
and ensure a full restore from memory.

>  This is why I'm thinking that we should maybe revisit the
> API a bit to make this clearer and to avoid scattering around all the
> state manipulation quite so much.  I propose, as a straw-man:
> 
> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
> current.
> After calling this, we're in state 1 or 2a.
> 
> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
> writable
> for current.  Invalidates any in-memory copy.  After calling this,
> we're in state 1.

How can we distinguish between these two?

If the task is running, it can change its FPU
registers at any time, without any obvious thing
to mark the transition from state 2a to state 1.

It sounds like the first function you name would
only be useful if we prevented a switch to user
space, which could immediately do whatever it wants
with the registers.

Is there a use case for user_fpregs_copy_to_cpu()?

> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid
> for
> current.  After calling this, we're in state 2a or 2b.
> 
> user_fpregs_move_to_memory() -- Makes the in-memory state be valid
> and
> writable for current.  After calling this, we're in state 2b.

We can only ensure that the in memory copy stays
valid, by stopping the task from writing to the
in-CPU copy.

This sounds a lot like fpu__activate_fpstate_read

What we need after copying the contents to memory is
a way to invalidate the in-CPU copy, if changes were
made to the in-memory copy.

This is already done by fpu__activate_fpstate_write

> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
> and PKRU memory state readers.  MPX and PKRU will do
> user_fpregs_move_to_memory() before writing to memory.
> kernel_fpu_begin() does user_fpregs_move_to_memory().

This is done by switch_fpu_prepare.

Can you think of any other place in the kernel where we
need a 1 -> 2a transition, besides context switching and
ptrace/xfpregs_get?

> There are a couple ways we could deal with TIF_LOAD_FPU in this
> scheme.  My instinct would be to rename it TIF_REFRESH_FPU, set it
> unconditionally on schedule-in *and* on 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Rik van Riel
On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel  wrote:
> > 
> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> > > 
> > > On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> > > > 
> > > > 
> > > >  #define CREATE_TRACE_POINTS
> > > >  #include 
> > > > @@ -197,6 +198,9 @@ __visible inline void
> > > > prepare_exit_to_usermode(struct pt_regs *regs)
> > > > if (unlikely(cached_flags &
> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
> > > > exit_to_usermode_loop(regs, cached_flags);
> > > > 
> > > > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> > > > +   switch_fpu_return();
> > > > +
> > > How about:
> > > 
> > > if (unlikely(...)) {
> > >   exit_to_usermode_loop(regs, cached_flags);
> > >   cached_flags = READ_ONCE(ti->flags);
> > > }
> > > 
> > > if (ti->flags & _TIF_LOAD_FPU) {
> > >   clear_thread_flag(TIF_LOAD_FPU);
> > >   switch_fpu_return();
> > > }
> > It looks like test_thread_flag should be fine for avoiding
> > atomics, too?
> > 
> >   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
> >   clear_thread_flag(TIF_LOAD_FPU);
> >   switch_fpu_return();
> >   }
> True, but we've got that cached_flags variable already...

The cached flag may no longer be valid after
exit_to_usermode_loop() returns, because that
code is preemptible.

We have to reload a fresh ti->flags after 
preemption is disabled and irqs are off.

> One of my objections to the current code structure is that I find it
> quite hard to keep track of all of the possible states of the fpu
> that
> we have.  Off the top of my head, we have:
> 
> 1. In-cpu state is valid and in-memory state is invalid.  (This is
> the
> case when we're running user code, for example.)
> 2. In-memory state is valid.
> 2a. In-cpu state is *also* valid for some particular task.  In this
> state, no one may write to either FPU state copy for that task lest
> they get out of sync.
> 2b. In-cpu state is not valid.

It gets more fun with ptrace. Look at xfpregs_get
and xfpregs_set, which call fpu__activate_fpstate_read
and fpu__activate_fpstate_write respectively.

It looks like those functions already deal with the
cases you outline above.

> Rik, your patches further complicate this by making it possible to
> have the in-cpu state be valid for a non-current task even when
> non-idle. 

However, we never need to examine the in-cpu state for
a task that is not currently running, or being scheduled
to in the context switch code.

We always save the FPU register state when a task is
context switched out, so for a non-current task, we
still only need to consider whether the in-memory state
is valid or not.

The in-cpu state does not matter.

fpu__activate_fpstate_write will disable lazy restore,
and ensure a full restore from memory.

>  This is why I'm thinking that we should maybe revisit the
> API a bit to make this clearer and to avoid scattering around all the
> state manipulation quite so much.  I propose, as a straw-man:
> 
> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
> current.
> After calling this, we're in state 1 or 2a.
> 
> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
> writable
> for current.  Invalidates any in-memory copy.  After calling this,
> we're in state 1.

How can we distinguish between these two?

If the task is running, it can change its FPU
registers at any time, without any obvious thing
to mark the transition from state 2a to state 1.

It sounds like the first function you name would
only be useful if we prevented a switch to user
space, which could immediately do whatever it wants
with the registers.

Is there a use case for user_fpregs_copy_to_cpu()?

> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid
> for
> current.  After calling this, we're in state 2a or 2b.
> 
> user_fpregs_move_to_memory() -- Makes the in-memory state be valid
> and
> writable for current.  After calling this, we're in state 2b.

We can only ensure that the in memory copy stays
valid, by stopping the task from writing to the
in-CPU copy.

This sounds a lot like fpu__activate_fpstate_read

What we need after copying the contents to memory is
a way to invalidate the in-CPU copy, if changes were
made to the in-memory copy.

This is already done by fpu__activate_fpstate_write

> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
> and PKRU memory state readers.  MPX and PKRU will do
> user_fpregs_move_to_memory() before writing to memory.
> kernel_fpu_begin() does user_fpregs_move_to_memory().

This is done by switch_fpu_prepare.

Can you think of any other place in the kernel where we
need a 1 -> 2a transition, besides context switching and
ptrace/xfpregs_get?

> There are a couple ways we could deal with TIF_LOAD_FPU in this
> scheme.  My instinct would be to rename it TIF_REFRESH_FPU, set it
> unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
> and make the 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel  wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
>> >
>> >  #define CREATE_TRACE_POINTS
>> >  #include 
>> > @@ -197,6 +198,9 @@ __visible inline void
>> > prepare_exit_to_usermode(struct pt_regs *regs)
>> > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
>> > exit_to_usermode_loop(regs, cached_flags);
>> >
>> > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > +   switch_fpu_return();
>> > +
>>
>> How about:
>>
>> if (unlikely(...)) {
>>   exit_to_usermode_loop(regs, cached_flags);
>>   cached_flags = READ_ONCE(ti->flags);
>> }
>>
>> if (ti->flags & _TIF_LOAD_FPU) {
>>   clear_thread_flag(TIF_LOAD_FPU);
>>   switch_fpu_return();
>> }
>
> It looks like test_thread_flag should be fine for avoiding
> atomics, too?
>
>   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>   clear_thread_flag(TIF_LOAD_FPU);
>   switch_fpu_return();
>   }

True, but we've got that cached_flags variable already...

>
>
>> > +static inline void switch_fpu_finish(void)
>> >  {
>> > +   set_thread_flag(TIF_LOAD_FPU);
>> >  }
>>
>> I can imagine this causing problems with kernel code that accesses
>> current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it
>> would
>> make sense to make your changes deeper into the FPU core innards so
>> that, for example, we'd have explicit functions that cause the
>
> Whereabouts do you have in mind?
>
> I like your general idea, but am not sure quite where
> to put it.

Somewhere in arch/x86/kernel/fpu/ perhaps?

One of my objections to the current code structure is that I find it
quite hard to keep track of all of the possible states of the fpu that
we have.  Off the top of my head, we have:

1. In-cpu state is valid and in-memory state is invalid.  (This is the
case when we're running user code, for example.)
2. In-memory state is valid.
2a. In-cpu state is *also* valid for some particular task.  In this
state, no one may write to either FPU state copy for that task lest
they get out of sync.
2b. In-cpu state is not valid.

Rik, your patches further complicate this by making it possible to
have the in-cpu state be valid for a non-current task even when
non-idle.  This is why I'm thinking that we should maybe revisit the
API a bit to make this clearer and to avoid scattering around all the
state manipulation quite so much.  I propose, as a straw-man:

user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for current.
After calling this, we're in state 1 or 2a.

user_fpregs_move_to_cpu() -- Makes the CPU state be valid and writable
for current.  Invalidates any in-memory copy.  After calling this,
we're in state 1.

user_fpregs_copy_to_memory() -- Makes the in-memory state be valid for
current.  After calling this, we're in state 2a or 2b.

user_fpregs_move_to_memory() -- Makes the in-memory state be valid and
writable for current.  After calling this, we're in state 2b.

The xxx_to_cpu() functions will make sure that, if the cpu state was
valid for a different task, it stops being valid for that different
task.

Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
and PKRU memory state readers.  MPX and PKRU will do
user_fpregs_move_to_memory() before writing to memory.
kernel_fpu_begin() does user_fpregs_move_to_memory().

There are a couple ways we could deal with TIF_LOAD_FPU in this
scheme.  My instinct would be to rename it TIF_REFRESH_FPU, set it
unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
and make the handler do user_fpregs_move_to_cpu().  This avoids
needing to deal with complications in which we're in state 2a and we
land in actual user code and then forget that the in-memory copy is
out of date.   It also means we get TIF_REFRESH_FPU for free for
newly-forked threads.

>> in-memory state for current to be up-to-date and readable, to cause
>> the in-memory state to be up-to-date and writable (which is the same
>> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
>> in-CPU state to be up-to-date (possibly readable and writable).
>> TIF_LOAD_FPU would trigger the latter.
>>
>> I've often found it confusing that fpu__save by itself has somewhat
>> ill-defined effects.
>
> Fully agreed on both. I guess that means we want a few
> different functions:
>
> 1) initialize FPU state, both in memory and in registers
>
> 2) cause FPU state in registers to be updated from memory,
>if necessary
>
> 3) cause FPU state in memory to be updated from registers,
>if necessary
>
> The latter functions could use fpu->fpregs_active to see
> whether any action is required, and be called from various
> places in the code.
>
> The signal code in particular is doing some strange things
> that should probably be hidden away in FPU specific functions.
>
>>
>> >
>> > +/*
>> > + * Set up the userspace FPU context before 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Andy Lutomirski
On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel  wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
>> >
>> >  #define CREATE_TRACE_POINTS
>> >  #include 
>> > @@ -197,6 +198,9 @@ __visible inline void
>> > prepare_exit_to_usermode(struct pt_regs *regs)
>> > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
>> > exit_to_usermode_loop(regs, cached_flags);
>> >
>> > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > +   switch_fpu_return();
>> > +
>>
>> How about:
>>
>> if (unlikely(...)) {
>>   exit_to_usermode_loop(regs, cached_flags);
>>   cached_flags = READ_ONCE(ti->flags);
>> }
>>
>> if (ti->flags & _TIF_LOAD_FPU) {
>>   clear_thread_flag(TIF_LOAD_FPU);
>>   switch_fpu_return();
>> }
>
> It looks like test_thread_flag should be fine for avoiding
> atomics, too?
>
>   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>   clear_thread_flag(TIF_LOAD_FPU);
>   switch_fpu_return();
>   }

True, but we've got that cached_flags variable already...

>
>
>> > +static inline void switch_fpu_finish(void)
>> >  {
>> > +   set_thread_flag(TIF_LOAD_FPU);
>> >  }
>>
>> I can imagine this causing problems with kernel code that accesses
>> current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it
>> would
>> make sense to make your changes deeper into the FPU core innards so
>> that, for example, we'd have explicit functions that cause the
>
> Whereabouts do you have in mind?
>
> I like your general idea, but am not sure quite where
> to put it.

Somewhere in arch/x86/kernel/fpu/ perhaps?

One of my objections to the current code structure is that I find it
quite hard to keep track of all of the possible states of the fpu that
we have.  Off the top of my head, we have:

1. In-cpu state is valid and in-memory state is invalid.  (This is the
case when we're running user code, for example.)
2. In-memory state is valid.
2a. In-cpu state is *also* valid for some particular task.  In this
state, no one may write to either FPU state copy for that task lest
they get out of sync.
2b. In-cpu state is not valid.

Rik, your patches further complicate this by making it possible to
have the in-cpu state be valid for a non-current task even when
non-idle.  This is why I'm thinking that we should maybe revisit the
API a bit to make this clearer and to avoid scattering around all the
state manipulation quite so much.  I propose, as a straw-man:

user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for current.
After calling this, we're in state 1 or 2a.

user_fpregs_move_to_cpu() -- Makes the CPU state be valid and writable
for current.  Invalidates any in-memory copy.  After calling this,
we're in state 1.

user_fpregs_copy_to_memory() -- Makes the in-memory state be valid for
current.  After calling this, we're in state 2a or 2b.

user_fpregs_move_to_memory() -- Makes the in-memory state be valid and
writable for current.  After calling this, we're in state 2b.

The xxx_to_cpu() functions will make sure that, if the cpu state was
valid for a different task, it stops being valid for that different
task.

Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
and PKRU memory state readers.  MPX and PKRU will do
user_fpregs_move_to_memory() before writing to memory.
kernel_fpu_begin() does user_fpregs_move_to_memory().

There are a couple ways we could deal with TIF_LOAD_FPU in this
scheme.  My instinct would be to rename it TIF_REFRESH_FPU, set it
unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
and make the handler do user_fpregs_move_to_cpu().  This avoids
needing to deal with complications in which we're in state 2a and we
land in actual user code and then forget that the in-memory copy is
out of date.   It also means we get TIF_REFRESH_FPU for free for
newly-forked threads.

>> in-memory state for current to be up-to-date and readable, to cause
>> the in-memory state to be up-to-date and writable (which is the same
>> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
>> in-CPU state to be up-to-date (possibly readable and writable).
>> TIF_LOAD_FPU would trigger the latter.
>>
>> I've often found it confusing that fpu__save by itself has somewhat
>> ill-defined effects.
>
> Fully agreed on both. I guess that means we want a few
> different functions:
>
> 1) initialize FPU state, both in memory and in registers
>
> 2) cause FPU state in registers to be updated from memory,
>if necessary
>
> 3) cause FPU state in memory to be updated from registers,
>if necessary
>
> The latter functions could use fpu->fpregs_active to see
> whether any action is required, and be called from various
> places in the code.
>
> The signal code in particular is doing some strange things
> that should probably be hidden away in FPU specific functions.
>
>>
>> >
>> > +/*
>> > + * Set up the userspace FPU context before returning to userspace.
>> > + */
>> > 

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Dave Hansen
On 10/01/2016 05:42 PM, Rik van Riel wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
>>> > > +static inline void switch_fpu_finish(void)
>>> > >  {
>>> > > +   set_thread_flag(TIF_LOAD_FPU);
>>> > >  }
>> > 
>> > I can imagine this causing problems with kernel code that accesses
>> > current's FPU state, e.g. get_xsave_field_ptr(). 
> That makes me wonder, what test programs do people have
> to verify the correctness of the FPU switching code?

The MPX testing code in selftests/ goes off the rails pretty quickly
when the FP/XSAVE state gets corrupt.  It has found quite a few bugs in
the last few years.  The protection keys code in there also keeps a
shadow copy of the "PKRU" register in software which also makes it
notice pretty quickly if something goes awry.

That said, I'd _love_ to see more formal FPU testing done.  We've had
lots of bugs in there, and they tend to hit the newer features and newer
CPUs.  Let me know if you find something. :)


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-03 Thread Dave Hansen
On 10/01/2016 05:42 PM, Rik van Riel wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
>>> > > +static inline void switch_fpu_finish(void)
>>> > >  {
>>> > > +   set_thread_flag(TIF_LOAD_FPU);
>>> > >  }
>> > 
>> > I can imagine this causing problems with kernel code that accesses
>> > current's FPU state, e.g. get_xsave_field_ptr(). 
> That makes me wonder, what test programs do people have
> to verify the correctness of the FPU switching code?

The MPX testing code in selftests/ goes off the rails pretty quickly
when the FP/XSAVE state gets corrupt.  It has found quite a few bugs in
the last few years.  The protection keys code in there also keeps a
shadow copy of the "PKRU" register in software which also makes it
notice pretty quickly if something goes awry.

That said, I'd _love_ to see more formal FPU testing done.  We've had
lots of bugs in there, and they tend to hit the newer features and newer
CPUs.  Let me know if you find something. :)


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread Rik van Riel
On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> > 
> > 
> > +static inline void switch_fpu_finish(void)
> >  {
> > +   set_thread_flag(TIF_LOAD_FPU);
> >  }
> 
> I can imagine this causing problems with kernel code that accesses
> current's FPU state, e.g. get_xsave_field_ptr(). 

That makes me wonder, what test programs do people have
to verify the correctness of the FPU switching code?

I have a few floating point benchmarks that check
the results for correctness, but nothing that adds in
signals, for example.

What do people use?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread Rik van Riel
On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> > 
> > 
> > +static inline void switch_fpu_finish(void)
> >  {
> > +   set_thread_flag(TIF_LOAD_FPU);
> >  }
> 
> I can imagine this causing problems with kernel code that accesses
> current's FPU state, e.g. get_xsave_field_ptr(). 

That makes me wonder, what test programs do people have
to verify the correctness of the FPU switching code?

I have a few floating point benchmarks that check
the results for correctness, but nothing that adds in
signals, for example.

What do people use?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread Rik van Riel
On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> > 
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -197,6 +198,9 @@ __visible inline void
> > prepare_exit_to_usermode(struct pt_regs *regs)
> > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
> > exit_to_usermode_loop(regs, cached_flags);
> > 
> > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> > +   switch_fpu_return();
> > +
> 
> How about:
> 
> if (unlikely(...)) {
>   exit_to_usermode_loop(regs, cached_flags);
>   cached_flags = READ_ONCE(ti->flags);
> }
> 
> if (ti->flags & _TIF_LOAD_FPU) {
>   clear_thread_flag(TIF_LOAD_FPU);
>   switch_fpu_return();
> }

It looks like test_thread_flag should be fine for avoiding
atomics, too?

  if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
      clear_thread_flag(TIF_LOAD_FPU);
      switch_fpu_return();
  }


> > +static inline void switch_fpu_finish(void)
> >  {
> > +   set_thread_flag(TIF_LOAD_FPU);
> >  }
> 
> I can imagine this causing problems with kernel code that accesses
> current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it
> would
> make sense to make your changes deeper into the FPU core innards so
> that, for example, we'd have explicit functions that cause the

Whereabouts do you have in mind?

I like your general idea, but am not sure quite where
to put it.

> in-memory state for current to be up-to-date and readable, to cause
> the in-memory state to be up-to-date and writable (which is the same
> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
> in-CPU state to be up-to-date (possibly readable and writable).
> TIF_LOAD_FPU would trigger the latter.
> 
> I've often found it confusing that fpu__save by itself has somewhat
> ill-defined effects.

Fully agreed on both. I guess that means we want a few
different functions:

1) initialize FPU state, both in memory and in registers

2) cause FPU state in registers to be updated from memory,
   if necessary

3) cause FPU state in memory to be updated from registers,
   if necessary

The latter functions could use fpu->fpregs_active to see
whether any action is required, and be called from various
places in the code.

The signal code in particular is doing some strange things
that should probably be hidden away in FPU specific functions.

> 
> > 
> > +/*
> > + * Set up the userspace FPU context before returning to userspace.
> > + */
> > +void switch_fpu_return(void)
> > +{
> > +   struct fpu *fpu = >thread.fpu;
> > +   bool preload;
> > +   /*
> > +* If the task has used the math, pre-load the FPU on xsave
> > processors
> > +* or if the past 5 consecutive context-switches used math.
> > +*/
> > +   preload = static_cpu_has(X86_FEATURE_FPU) &&
> > + fpu->fpstate_active &&
> > + (use_eager_fpu() || fpu->counter > 5);
> > +
> > +   if (preload) {
> > +   prefetch(>state);
> > +   fpu->counter++;
> > +   __fpregs_activate(fpu);
> > +   trace_x86_fpu_regs_activated(fpu);
> > +
> > +   /* Don't change CR0.TS if we just switch! */
> > +   if (!__this_cpu_read(fpu_active)) {
> > +   __fpregs_activate_hw();
> > +   __this_cpu_write(fpu_active, true);
> > +   }
> 
> We should just finish getting rid of all TS uses.

Agreed. I will rebase on top of your FPU changes, that
will make things easier for everybody.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread Rik van Riel
On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> > 
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -197,6 +198,9 @@ __visible inline void
> > prepare_exit_to_usermode(struct pt_regs *regs)
> > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
> > exit_to_usermode_loop(regs, cached_flags);
> > 
> > +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> > +   switch_fpu_return();
> > +
> 
> How about:
> 
> if (unlikely(...)) {
>   exit_to_usermode_loop(regs, cached_flags);
>   cached_flags = READ_ONCE(ti->flags);
> }
> 
> if (ti->flags & _TIF_LOAD_FPU) {
>   clear_thread_flag(TIF_LOAD_FPU);
>   switch_fpu_return();
> }

It looks like test_thread_flag should be fine for avoiding
atomics, too?

  if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
      clear_thread_flag(TIF_LOAD_FPU);
      switch_fpu_return();
  }


> > +static inline void switch_fpu_finish(void)
> >  {
> > +   set_thread_flag(TIF_LOAD_FPU);
> >  }
> 
> I can imagine this causing problems with kernel code that accesses
> current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it
> would
> make sense to make your changes deeper into the FPU core innards so
> that, for example, we'd have explicit functions that cause the

Whereabouts do you have in mind?

I like your general idea, but am not sure quite where
to put it.

> in-memory state for current to be up-to-date and readable, to cause
> the in-memory state to be up-to-date and writable (which is the same
> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
> in-CPU state to be up-to-date (possibly readable and writable).
> TIF_LOAD_FPU would trigger the latter.
> 
> I've often found it confusing that fpu__save by itself has somewhat
> ill-defined effects.

Fully agreed on both. I guess that means we want a few
different functions:

1) initialize FPU state, both in memory and in registers

2) cause FPU state in registers to be updated from memory,
   if necessary

3) cause FPU state in memory to be updated from registers,
   if necessary

The latter functions could use fpu->fpregs_active to see
whether any action is required, and be called from various
places in the code.

The signal code in particular is doing some strange things
that should probably be hidden away in FPU specific functions.

> 
> > 
> > +/*
> > + * Set up the userspace FPU context before returning to userspace.
> > + */
> > +void switch_fpu_return(void)
> > +{
> > +   struct fpu *fpu = >thread.fpu;
> > +   bool preload;
> > +   /*
> > +* If the task has used the math, pre-load the FPU on xsave
> > processors
> > +* or if the past 5 consecutive context-switches used math.
> > +*/
> > +   preload = static_cpu_has(X86_FEATURE_FPU) &&
> > + fpu->fpstate_active &&
> > + (use_eager_fpu() || fpu->counter > 5);
> > +
> > +   if (preload) {
> > +   prefetch(>state);
> > +   fpu->counter++;
> > +   __fpregs_activate(fpu);
> > +   trace_x86_fpu_regs_activated(fpu);
> > +
> > +   /* Don't change CR0.TS if we just switch! */
> > +   if (!__this_cpu_read(fpu_active)) {
> > +   __fpregs_activate_hw();
> > +   __this_cpu_write(fpu_active, true);
> > +   }
> 
> We should just finish getting rid of all TS uses.

Agreed. I will rebase on top of your FPU changes, that
will make things easier for everybody.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread Andy Lutomirski
On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> From: Rik van Riel 
>
> Delay the loading of FPU registers until a process switches back to
> userspace. This allows us to skip FPU saving & restoring for kernel
> threads, the idle task, and tasks that are spinning in kernel space.
>
> It also allows us to not repeatedly save & restore the userspace FPU
> context on repeated invocations of kernel_fpu_start & kernel_fpu_end.
>
> Not overwriting the FPU state of a task unless we need to also allows
> us to be be lazier about restoring it, in a future patch.
>
> Signed-off-by: Rik van Riel 
> ---
>  arch/x86/entry/common.c |  4 
>  arch/x86/include/asm/fpu/api.h  |  5 +
>  arch/x86/include/asm/fpu/internal.h | 44 
> +
>  arch/x86/include/asm/thread_info.h  |  4 +++-
>  arch/x86/kernel/fpu/core.c  | 17 --
>  arch/x86/kernel/process.c   | 35 +
>  arch/x86/kernel/process_32.c|  5 ++---
>  arch/x86/kernel/process_64.c|  5 ++---
>  8 files changed, 71 insertions(+), 48 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 1433f6b4607d..a69bbefa3408 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct 
> pt_regs *regs)
> if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
> exit_to_usermode_loop(regs, cached_flags);
>
> +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> +   switch_fpu_return();
> +

How about:

if (unlikely(...)) {
  exit_to_usermode_loop(regs, cached_flags);
  cached_flags = READ_ONCE(ti->flags);
}

if (ti->flags & _TIF_LOAD_FPU) {
  clear_thread_flag(TIF_LOAD_FPU);
  switch_fpu_return();
}

or something along those lines.  The issue is that
test_and_clear_thread_flag is unconditionally atomic, which means that
it will slow down the common fast case by quite a bit.

Alternatively, you could try to jam it into thread_struct::status,
which would let you avoid an atomic operation when you clear it, but
you'd still need an atomic op to set it, so this might not be a win.


> +static inline void switch_fpu_finish(void)
>  {
> -   bool preload;
> -   /*
> -* If the task has used the math, pre-load the FPU on xsave processors
> -* or if the past 5 consecutive context-switches used math.
> -*/
> -   preload = static_cpu_has(X86_FEATURE_FPU) &&
> - new_fpu->fpstate_active &&
> - (use_eager_fpu() || new_fpu->counter > 5);
> -
> -   if (preload) {
> -   prefetch(_fpu->state);
> -   new_fpu->counter++;
> -   __fpregs_activate(new_fpu);
> -   trace_x86_fpu_regs_activated(new_fpu);
> -
> -   /* Don't change CR0.TS if we just switch! */
> -   if (!__this_cpu_read(fpu_active)) {
> -   __fpregs_activate_hw();
> -   __this_cpu_write(fpu_active, true);
> -   }
> -
> -   copy_kernel_to_fpregs(_fpu->state);
> -   } else if (__this_cpu_read(fpu_active)) {
> -   __this_cpu_write(fpu_active, false);
> -   __fpregs_deactivate_hw();
> -   }
> +   set_thread_flag(TIF_LOAD_FPU);
>  }

I can imagine this causing problems with kernel code that accesses
current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it would
make sense to make your changes deeper into the FPU core innards so
that, for example, we'd have explicit functions that cause the
in-memory state for current to be up-to-date and readable, to cause
the in-memory state to be up-to-date and writable (which is the same
thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
in-CPU state to be up-to-date (possibly readable and writable).
TIF_LOAD_FPU would trigger the latter.

I've often found it confusing that fpu__save by itself has somewhat
ill-defined effects.


> +/*
> + * Set up the userspace FPU context before returning to userspace.
> + */
> +void switch_fpu_return(void)
> +{
> +   struct fpu *fpu = >thread.fpu;
> +   bool preload;
> +   /*
> +* If the task has used the math, pre-load the FPU on xsave processors
> +* or if the past 5 consecutive context-switches used math.
> +*/
> +   preload = static_cpu_has(X86_FEATURE_FPU) &&
> + fpu->fpstate_active &&
> + (use_eager_fpu() || fpu->counter > 5);
> +
> +   if (preload) {
> +   prefetch(>state);
> +   fpu->counter++;
> +   __fpregs_activate(fpu);
> +   trace_x86_fpu_regs_activated(fpu);
> +
> +   /* Don't change CR0.TS if we just switch! */
> +  

Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread Andy Lutomirski
On Sat, Oct 1, 2016 at 1:31 PM,   wrote:
> From: Rik van Riel 
>
> Delay the loading of FPU registers until a process switches back to
> userspace. This allows us to skip FPU saving & restoring for kernel
> threads, the idle task, and tasks that are spinning in kernel space.
>
> It also allows us to not repeatedly save & restore the userspace FPU
> context on repeated invocations of kernel_fpu_start & kernel_fpu_end.
>
> Not overwriting the FPU state of a task unless we need to also allows
> us to be be lazier about restoring it, in a future patch.
>
> Signed-off-by: Rik van Riel 
> ---
>  arch/x86/entry/common.c |  4 
>  arch/x86/include/asm/fpu/api.h  |  5 +
>  arch/x86/include/asm/fpu/internal.h | 44 
> +
>  arch/x86/include/asm/thread_info.h  |  4 +++-
>  arch/x86/kernel/fpu/core.c  | 17 --
>  arch/x86/kernel/process.c   | 35 +
>  arch/x86/kernel/process_32.c|  5 ++---
>  arch/x86/kernel/process_64.c|  5 ++---
>  8 files changed, 71 insertions(+), 48 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 1433f6b4607d..a69bbefa3408 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct 
> pt_regs *regs)
> if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
> exit_to_usermode_loop(regs, cached_flags);
>
> +   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> +   switch_fpu_return();
> +

How about:

if (unlikely(...)) {
  exit_to_usermode_loop(regs, cached_flags);
  cached_flags = READ_ONCE(ti->flags);
}

if (ti->flags & _TIF_LOAD_FPU) {
  clear_thread_flag(TIF_LOAD_FPU);
  switch_fpu_return();
}

or something along those lines.  The issue is that
test_and_clear_thread_flag is unconditionally atomic, which means that
it will slow down the common fast case by quite a bit.

Alternatively, you could try to jam it into thread_struct::status,
which would let you avoid an atomic operation when you clear it, but
you'd still need an atomic op to set it, so this might not be a win.


> +static inline void switch_fpu_finish(void)
>  {
> -   bool preload;
> -   /*
> -* If the task has used the math, pre-load the FPU on xsave processors
> -* or if the past 5 consecutive context-switches used math.
> -*/
> -   preload = static_cpu_has(X86_FEATURE_FPU) &&
> - new_fpu->fpstate_active &&
> - (use_eager_fpu() || new_fpu->counter > 5);
> -
> -   if (preload) {
> -   prefetch(_fpu->state);
> -   new_fpu->counter++;
> -   __fpregs_activate(new_fpu);
> -   trace_x86_fpu_regs_activated(new_fpu);
> -
> -   /* Don't change CR0.TS if we just switch! */
> -   if (!__this_cpu_read(fpu_active)) {
> -   __fpregs_activate_hw();
> -   __this_cpu_write(fpu_active, true);
> -   }
> -
> -   copy_kernel_to_fpregs(_fpu->state);
> -   } else if (__this_cpu_read(fpu_active)) {
> -   __this_cpu_write(fpu_active, false);
> -   __fpregs_deactivate_hw();
> -   }
> +   set_thread_flag(TIF_LOAD_FPU);
>  }

I can imagine this causing problems with kernel code that accesses
current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it would
make sense to make your changes deeper into the FPU core innards so
that, for example, we'd have explicit functions that cause the
in-memory state for current to be up-to-date and readable, to cause
the in-memory state to be up-to-date and writable (which is the same
thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
in-CPU state to be up-to-date (possibly readable and writable).
TIF_LOAD_FPU would trigger the latter.

I've often found it confusing that fpu__save by itself has somewhat
ill-defined effects.


> +/*
> + * Set up the userspace FPU context before returning to userspace.
> + */
> +void switch_fpu_return(void)
> +{
> +   struct fpu *fpu = >thread.fpu;
> +   bool preload;
> +   /*
> +* If the task has used the math, pre-load the FPU on xsave processors
> +* or if the past 5 consecutive context-switches used math.
> +*/
> +   preload = static_cpu_has(X86_FEATURE_FPU) &&
> + fpu->fpstate_active &&
> + (use_eager_fpu() || fpu->counter > 5);
> +
> +   if (preload) {
> +   prefetch(>state);
> +   fpu->counter++;
> +   __fpregs_activate(fpu);
> +   trace_x86_fpu_regs_activated(fpu);
> +
> +   /* Don't change CR0.TS if we just switch! */
> +   if (!__this_cpu_read(fpu_active)) {
> +   

[PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread riel
From: Rik van Riel 

Delay the loading of FPU registers until a process switches back to
userspace. This allows us to skip FPU saving & restoring for kernel
threads, the idle task, and tasks that are spinning in kernel space.

It also allows us to not repeatedly save & restore the userspace FPU
context on repeated invocations of kernel_fpu_start & kernel_fpu_end.

Not overwriting the FPU state of a task unless we need to also allows
us to be be lazier about restoring it, in a future patch.

Signed-off-by: Rik van Riel 
---
 arch/x86/entry/common.c |  4 
 arch/x86/include/asm/fpu/api.h  |  5 +
 arch/x86/include/asm/fpu/internal.h | 44 +
 arch/x86/include/asm/thread_info.h  |  4 +++-
 arch/x86/kernel/fpu/core.c  | 17 --
 arch/x86/kernel/process.c   | 35 +
 arch/x86/kernel/process_32.c|  5 ++---
 arch/x86/kernel/process_64.c|  5 ++---
 8 files changed, 71 insertions(+), 48 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1433f6b4607d..a69bbefa3408 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct 
pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);
 
+   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
+   switch_fpu_return();
+
 #ifdef CONFIG_COMPAT
/*
 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 1429a7c736db..edd7dc7ae4f7 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -37,6 +37,11 @@ extern int  irq_ts_save(void);
 extern void irq_ts_restore(int TS_state);
 
 /*
+ * Set up the userspace FPU context before returning to userspace.
+ */
+extern void switch_fpu_return(void);
+
+/*
  * Query the presence of one or more xfeatures. Works on any legacy CPU as 
well.
  *
  * If 'feature_name' is set then put a human-readable description of
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 79e1cee9f3b0..b5accb35e434 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * High level FPU state handling functions:
@@ -576,13 +577,17 @@ static inline void fpregs_deactivate(struct fpu *fpu)
 /*
  * FPU state switching for scheduling.
  *
- * This is a two-stage process:
+ * This is a three-stage process:
  *
  *  - switch_fpu_prepare() saves the old state.
  *This is done within the context of the old process.
  *
- *  - switch_fpu_finish() restores the new state
- *and flips CR0.TS as necessary.
+ *  - switch_fpu_finish() sets TIF_LOAD_CPU, causing FPU state to
+ *be loaded when the new process returns to userspace.
+ *This is done with current_task pointing to the new process.
+ *
+ *  - switch_fpu_return() restores the new state and flips CR0.TS as
+ *necessary. This only runs if the process returns to userspace.
  */
 static inline void
 switch_fpu_prepare(struct fpu *old_fpu, int cpu)
@@ -605,38 +610,9 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
 /*
  * Misc helper functions:
  */
-
-/*
- * Set up the userspace FPU context for the new task.
- */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(void)
 {
-   bool preload;
-   /*
-* If the task has used the math, pre-load the FPU on xsave processors
-* or if the past 5 consecutive context-switches used math.
-*/
-   preload = static_cpu_has(X86_FEATURE_FPU) &&
- new_fpu->fpstate_active &&
- (use_eager_fpu() || new_fpu->counter > 5);
-
-   if (preload) {
-   prefetch(_fpu->state);
-   new_fpu->counter++;
-   __fpregs_activate(new_fpu);
-   trace_x86_fpu_regs_activated(new_fpu);
-
-   /* Don't change CR0.TS if we just switch! */
-   if (!__this_cpu_read(fpu_active)) {
-   __fpregs_activate_hw();
-   __this_cpu_write(fpu_active, true);
-   }
-
-   copy_kernel_to_fpregs(_fpu->state);
-   } else if (__this_cpu_read(fpu_active)) {
-   __this_cpu_write(fpu_active, false);
-   __fpregs_deactivate_hw();
-   }
+   set_thread_flag(TIF_LOAD_FPU);
 }
 
 /*
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 8b7c8d8e0852..401e9c3e6039 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -106,6 

[PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

2016-10-01 Thread riel
From: Rik van Riel 

Delay the loading of FPU registers until a process switches back to
userspace. This allows us to skip FPU saving & restoring for kernel
threads, the idle task, and tasks that are spinning in kernel space.

It also allows us to not repeatedly save & restore the userspace FPU
context on repeated invocations of kernel_fpu_start & kernel_fpu_end.

Not overwriting the FPU state of a task unless we need to also allows
us to be be lazier about restoring it, in a future patch.

Signed-off-by: Rik van Riel 
---
 arch/x86/entry/common.c |  4 
 arch/x86/include/asm/fpu/api.h  |  5 +
 arch/x86/include/asm/fpu/internal.h | 44 +
 arch/x86/include/asm/thread_info.h  |  4 +++-
 arch/x86/kernel/fpu/core.c  | 17 --
 arch/x86/kernel/process.c   | 35 +
 arch/x86/kernel/process_32.c|  5 ++---
 arch/x86/kernel/process_64.c|  5 ++---
 8 files changed, 71 insertions(+), 48 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1433f6b4607d..a69bbefa3408 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct 
pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);
 
+   if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
+   switch_fpu_return();
+
 #ifdef CONFIG_COMPAT
/*
 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 1429a7c736db..edd7dc7ae4f7 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -37,6 +37,11 @@ extern int  irq_ts_save(void);
 extern void irq_ts_restore(int TS_state);
 
 /*
+ * Set up the userspace FPU context before returning to userspace.
+ */
+extern void switch_fpu_return(void);
+
+/*
  * Query the presence of one or more xfeatures. Works on any legacy CPU as 
well.
  *
  * If 'feature_name' is set then put a human-readable description of
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 79e1cee9f3b0..b5accb35e434 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * High level FPU state handling functions:
@@ -576,13 +577,17 @@ static inline void fpregs_deactivate(struct fpu *fpu)
 /*
  * FPU state switching for scheduling.
  *
- * This is a two-stage process:
+ * This is a three-stage process:
  *
  *  - switch_fpu_prepare() saves the old state.
  *This is done within the context of the old process.
  *
- *  - switch_fpu_finish() restores the new state
- *and flips CR0.TS as necessary.
+ *  - switch_fpu_finish() sets TIF_LOAD_CPU, causing FPU state to
+ *be loaded when the new process returns to userspace.
+ *This is done with current_task pointing to the new process.
+ *
+ *  - switch_fpu_return() restores the new state and flips CR0.TS as
+ *necessary. This only runs if the process returns to userspace.
  */
 static inline void
 switch_fpu_prepare(struct fpu *old_fpu, int cpu)
@@ -605,38 +610,9 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
 /*
  * Misc helper functions:
  */
-
-/*
- * Set up the userspace FPU context for the new task.
- */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(void)
 {
-   bool preload;
-   /*
-* If the task has used the math, pre-load the FPU on xsave processors
-* or if the past 5 consecutive context-switches used math.
-*/
-   preload = static_cpu_has(X86_FEATURE_FPU) &&
- new_fpu->fpstate_active &&
- (use_eager_fpu() || new_fpu->counter > 5);
-
-   if (preload) {
-   prefetch(_fpu->state);
-   new_fpu->counter++;
-   __fpregs_activate(new_fpu);
-   trace_x86_fpu_regs_activated(new_fpu);
-
-   /* Don't change CR0.TS if we just switch! */
-   if (!__this_cpu_read(fpu_active)) {
-   __fpregs_activate_hw();
-   __this_cpu_write(fpu_active, true);
-   }
-
-   copy_kernel_to_fpregs(_fpu->state);
-   } else if (__this_cpu_read(fpu_active)) {
-   __this_cpu_write(fpu_active, false);
-   __fpregs_deactivate_hw();
-   }
+   set_thread_flag(TIF_LOAD_FPU);
 }
 
 /*
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 8b7c8d8e0852..401e9c3e6039 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -106,6 +106,7 @@ struct thread_info {