Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-19 Thread Mathieu Desnoyers
- On Sep 19, 2019, at 12:26 PM, Will Deacon w...@kernel.org wrote:

[...]
>> 
>> The current wording from membarrier(2) is:
>> 
>>   The  "expedited" commands complete faster than the 
>> non-expedited
>>   ones; they never block, but have the downside of  causing  
>> extra
>>   overhead.
>> 
>> We could simply remove the "; they never block" part then ?
> 
> I think so, yes. That or, "; they do not voluntarily block" or something
> like that. Maybe look at other man pages for inspiration ;)

OK, let's tackle the man-pages part after the fix reaches mainline though.

[...]
> 
> I reckon you'll be fine using GFP_KERNEL and returning -ENOMEM on allocation
> failure. This shouldn't happen in practice and it removes the fallback
> path.

Works for me! I'll prepare an updated patchset.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-19 Thread Will Deacon
Hi Mathieu,

Sorry for the delay in responding.

On Fri, Sep 13, 2019 at 10:22:28AM -0400, Mathieu Desnoyers wrote:
> - On Sep 12, 2019, at 11:47 AM, Will Deacon w...@kernel.org wrote:
> 
> > On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote:
> >> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon  wrote:
> >> >
> >> > So the man page for sys_membarrier states that the expedited variants 
> >> > "never
> >> > block", which feels pretty strong. Do any other system calls claim to
> >> > provide this guarantee without a failure path if blocking is necessary?
> >> 
> >> The traditional semantics for "we don't block" is that "we block on
> >> memory allocations and locking and user accesses etc, but  we don't
> >> wait for our own IO".
> >> 
> >> So there may be new IO started (and waited on) as part of allocating
> >> new memory etc, or in just paging in user memory, but the IO that the
> >> operation _itself_ explicitly starts is not waited on.
> > 
> > Thanks, that makes sense, and I'd be inclined to suggest an update to the
> > sys_membarrier manpage to make this more clear since the "never blocks"
> > phrasing doesn't seem to be used like this for other system calls.
> 
> The current wording from membarrier(2) is:
> 
>   The  "expedited" commands complete faster than the non-expedited
>   ones; they never block, but have the downside of  causing  extra
>   overhead.
> 
> We could simply remove the "; they never block" part then ?

I think so, yes. That or, "; they do not voluntarily block" or something
like that. Maybe look at other man pages for inspiration ;)

> >> No system call should ever be considered "atomic" in any sense. If
> >> you're doing RT, you should maybe expect "getpid()" to not ever block,
> >> but that's just about the exclusive list of truly nonblocking system
> >> calls, and even that can be preempted.
> > 
> > In which case, why can't we just use GFP_KERNEL for the cpumask allocation
> > instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu?
> 
> Looking at:
> 
> #define GFP_KERNEL  (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> 
> I notice that it does not include __GFP_NOFAIL. What prevents GFP_KERNEL from
> failing, and where is this guarantee documented ?

There was an lwn article a little while ago about this:

https://lwn.net/Articles/723317/

I'm not sure what (if anything) has changed in this regard since then,
however.

> Regarding __GFP_NOFAIL, its use seems to be discouraged in linux/gfp.h:
> 
>  * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures. The allocation could block
>  * indefinitely but will never return with failure. Testing for
>  * failure is pointless.
>  * New users should be evaluated carefully (and the flag should be
>  * used only when there is no reasonable failure policy) but it is
>  * definitely preferable to use the flag rather than opencode endless
>  * loop around allocator.
>  * Using this flag for costly allocations is _highly_ discouraged.
> 
> So I am reluctant to use it.
> 
> But if we can agree on the right combination of flags that guarantees there
> is no failure, I would be perfectly fine with using them to remove the 
> fallback
> code.

I reckon you'll be fine using GFP_KERNEL and returning -ENOMEM on allocation
failure. This shouldn't happen in practice and it removes the fallback
path.

Will


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-13 Thread Mathieu Desnoyers



- On Sep 13, 2019, at 12:04 PM, Oleg Nesterov o...@redhat.com wrote:

> On 09/13, Mathieu Desnoyers wrote:
>>
>> membarrier_exec_mmap(), which seems to be affected by the same problem.
> 
> IIRC, in the last version it is called by exec_mmap() undef task_lock(),
> so it should fine.

Fair point. Although it seems rather cleaner to use this_cpu_write() in
all 3 sites updating this variable rather than a mix of this_cpu_write
and WRITE_ONCE(), unless anyone objects.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-13 Thread Oleg Nesterov
On 09/13, Mathieu Desnoyers wrote:
>
> membarrier_exec_mmap(), which seems to be affected by the same problem.

IIRC, in the last version it is called by exec_mmap() undef task_lock(),
so it should fine.

Oleg.



Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-13 Thread Mathieu Desnoyers
- On Sep 9, 2019, at 7:00 AM, Oleg Nesterov o...@redhat.com wrote:

> On 09/08, Mathieu Desnoyers wrote:
>>
>> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
>> +{
>> +int membarrier_state = atomic_read(>membarrier_state);
>> +bool fallback = false;
>> +cpumask_var_t tmpmask;
>> +int cpu;
>> +
>> +if (atomic_read(>mm_users) == 1 || num_online_cpus() == 1) {
>> +WRITE_ONCE(this_rq()->membarrier_state, membarrier_state);
> 
> This doesn't look safe, this caller can migrate to another CPU after
> it calculates the per-cpu ptr.
> 
> I think you need do disable preemption or simply use this_cpu_write().

Good point! I'll use this_cpu_write() there and within
membarrier_exec_mmap(), which seems to be affected by the same problem.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-13 Thread Mathieu Desnoyers
- On Sep 12, 2019, at 11:47 AM, Will Deacon w...@kernel.org wrote:

> On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote:
>> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon  wrote:
>> >
>> > So the man page for sys_membarrier states that the expedited variants 
>> > "never
>> > block", which feels pretty strong. Do any other system calls claim to
>> > provide this guarantee without a failure path if blocking is necessary?
>> 
>> The traditional semantics for "we don't block" is that "we block on
>> memory allocations and locking and user accesses etc, but  we don't
>> wait for our own IO".
>> 
>> So there may be new IO started (and waited on) as part of allocating
>> new memory etc, or in just paging in user memory, but the IO that the
>> operation _itself_ explicitly starts is not waited on.
> 
> Thanks, that makes sense, and I'd be inclined to suggest an update to the
> sys_membarrier manpage to make this more clear since the "never blocks"
> phrasing doesn't seem to be used like this for other system calls.

The current wording from membarrier(2) is:

  The  "expedited" commands complete faster than the non-expedited
  ones; they never block, but have the downside of  causing  extra
  overhead.

We could simply remove the "; they never block" part then ?

> 
>> No system call should ever be considered "atomic" in any sense. If
>> you're doing RT, you should maybe expect "getpid()" to not ever block,
>> but that's just about the exclusive list of truly nonblocking system
>> calls, and even that can be preempted.
> 
> In which case, why can't we just use GFP_KERNEL for the cpumask allocation
> instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu?

Looking at:

#define GFP_KERNEL  (__GFP_RECLAIM | __GFP_IO | __GFP_FS)

I notice that it does not include __GFP_NOFAIL. What prevents GFP_KERNEL from
failing, and where is this guarantee documented ?

Regarding __GFP_NOFAIL, its use seems to be discouraged in linux/gfp.h:

 * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. The allocation could block
 * indefinitely but will never return with failure. Testing for
 * failure is pointless.
 * New users should be evaluated carefully (and the flag should be
 * used only when there is no reasonable failure policy) but it is
 * definitely preferable to use the flag rather than opencode endless
 * loop around allocator.
 * Using this flag for costly allocations is _highly_ discouraged.

So I am reluctant to use it.

But if we can agree on the right combination of flags that guarantees there
is no failure, I would be perfectly fine with using them to remove the fallback
code.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-12 Thread Will Deacon
On Thu, Sep 12, 2019 at 03:24:35PM +0100, Linus Torvalds wrote:
> On Thu, Sep 12, 2019 at 2:48 PM Will Deacon  wrote:
> >
> > So the man page for sys_membarrier states that the expedited variants "never
> > block", which feels pretty strong. Do any other system calls claim to
> > provide this guarantee without a failure path if blocking is necessary?
> 
> The traditional semantics for "we don't block" is that "we block on
> memory allocations and locking and user accesses etc, but  we don't
> wait for our own IO".
> 
> So there may be new IO started (and waited on) as part of allocating
> new memory etc, or in just paging in user memory, but the IO that the
> operation _itself_ explicitly starts is not waited on.

Thanks, that makes sense, and I'd be inclined to suggest an update to the
sys_membarrier manpage to make this more clear since the "never blocks"
phrasing doesn't seem to be used like this for other system calls.

> No system call should ever be considered "atomic" in any sense. If
> you're doing RT, you should maybe expect "getpid()" to not ever block,
> but that's just about the exclusive list of truly nonblocking system
> calls, and even that can be preempted.

In which case, why can't we just use GFP_KERNEL for the cpumask allocation
instead of GFP_NOWAIT and then remove the failure path altogether? Mathieu?

Will


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-12 Thread Linus Torvalds
On Thu, Sep 12, 2019 at 2:48 PM Will Deacon  wrote:
>
> So the man page for sys_membarrier states that the expedited variants "never
> block", which feels pretty strong. Do any other system calls claim to
> provide this guarantee without a failure path if blocking is necessary?

The traditional semantics for "we don't block" is that "we block on
memory allocations and locking and user accesses etc, but  we don't
wait for our own IO".

So there may be new IO started (and waited on) as part of allocating
new memory etc, or in just paging in user memory, but the IO that the
operation _itself_ explicitly starts is not waited on.

No system call should ever be considered "atomic" in any sense. If
you're doing RT, you should maybe expect "getpid()" to not ever block,
but that's just about the exclusive list of truly nonblocking system
calls, and even that can be preempted.

Linus


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-12 Thread Will Deacon
On Tue, Sep 10, 2019 at 05:48:02AM -0400, Mathieu Desnoyers wrote:
> - On Sep 8, 2019, at 5:51 PM, Linus Torvalds 
> torva...@linux-foundation.org wrote:
> 
> > On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers
> >  wrote:
> >>
> >> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
> >> +{
> >> +   int membarrier_state = atomic_read(>membarrier_state);
> >> +   bool fallback = false;
> >> +   cpumask_var_t tmpmask;
> >> +
> >> +   if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
> >> +   /* Fallback for OOM. */
> >> +   fallback = true;
> >> +   }
> >> +
> >> +   /*
> >> +* For each cpu runqueue, if the task's mm match @mm, ensure that 
> >> all
> >> +* @mm's membarrier state set bits are also set in in the 
> >> runqueue's
> >> +* membarrier state. This ensures that a runqueue scheduling
> >> +* between threads which are users of @mm has its membarrier state
> >> +* updated.
> >> +*/
> >> +   cpus_read_lock();
> >> +   rcu_read_lock();
> >> +   for_each_online_cpu(cpu) {
> >> +   struct rq *rq = cpu_rq(cpu);
> >> +   struct task_struct *p;
> >> +
> >> +   p = task_rcu_dereference(>curr);
> >> +   if (p && p->mm == mm) {
> >> +   if (!fallback)
> >> +   __cpumask_set_cpu(cpu, tmpmask);
> >> +   else
> >> +   smp_call_function_single(cpu, 
> >> ipi_sync_rq_state,
> >> +mm, 1);
> >> +   }
> >> +   }
> > 
> > I really absolutely detest this whole "fallback" code.
> > 
> > It will never get any real testing, and the code is just broken.
> > 
> > Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it
> > will possibly call too many CPU's, but this fallback code is just
> > completely disgusting.
> > 
> > Do a simple and clean implementation. Then, if you can show real
> > performance issues (which I doubt), maybe do something else, but even
> > then you should never do something that will effectively create cases
> > that have absolutely zero test-coverage.
> 
> A few points worth mentioning here:
> 
> 1) As I stated earlier, using mm_cpumask in its current form is not
>an option for membarrier. For two reasons:
> 
>A) The mask is not populated on all architectures (e.g. arm64 does
>   not populate it),
> 
>B) Even if it was populated on all architectures, we would need to
>   carefully audit and document every spot where this mm_cpumask
>   is set or cleared within each architecture code, and ensure we
>   have the required memory barriers between user-space memory
>   accesses and those stores, documenting those requirements into
>   each architecture code in the process. This seems to be a lot of
>   useless error-prone code churn.
> 
> 2) I should actually use GFP_KERNEL rather than GFP_NOWAIT in this
>membarrier registration code. But it can still fail. However, the other
>membarrier code using the same fallback pattern (private and global
>expedited) documents that those membarrier commands do not block in
>the membarrier(2) man page, so GFP_NOWAIT is appropriate in those cases.
> 
> 3) Testing-wise, I fully agree with your argument of lacking test coverage.
>One option I'm considering would be to add a selftest based on the
>fault-injection infrastructure, which would ensure that we have coverage
>of the failure case in the kernel selftests.
> 
> Thoughts ?

So the man page for sys_membarrier states that the expedited variants "never
block", which feels pretty strong. Do any other system calls claim to
provide this guarantee without a failure path if blocking is necessary?
Given that the whole thing is preemptible, I'm also curious as to how
exactly userspace relies on this non-blocking guarantee.  I'd have thought
that you could either just bite the bullet and block in the rare case that
you need to when allocating the cpumask, or you could just return
-EWOULDBLOCK on allocation failure, given that I suspect there are very few
users of this system call right now and it's not yet supported by glibc
afaik.

Will


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-10 Thread Mathieu Desnoyers
- On Sep 8, 2019, at 5:51 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers
>  wrote:
>>
>> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
>> +{
>> +   int membarrier_state = atomic_read(>membarrier_state);
>> +   bool fallback = false;
>> +   cpumask_var_t tmpmask;
>> +
>> +   if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
>> +   /* Fallback for OOM. */
>> +   fallback = true;
>> +   }
>> +
>> +   /*
>> +* For each cpu runqueue, if the task's mm match @mm, ensure that all
>> +* @mm's membarrier state set bits are also set in in the runqueue's
>> +* membarrier state. This ensures that a runqueue scheduling
>> +* between threads which are users of @mm has its membarrier state
>> +* updated.
>> +*/
>> +   cpus_read_lock();
>> +   rcu_read_lock();
>> +   for_each_online_cpu(cpu) {
>> +   struct rq *rq = cpu_rq(cpu);
>> +   struct task_struct *p;
>> +
>> +   p = task_rcu_dereference(>curr);
>> +   if (p && p->mm == mm) {
>> +   if (!fallback)
>> +   __cpumask_set_cpu(cpu, tmpmask);
>> +   else
>> +   smp_call_function_single(cpu, 
>> ipi_sync_rq_state,
>> +mm, 1);
>> +   }
>> +   }
> 
> I really absolutely detest this whole "fallback" code.
> 
> It will never get any real testing, and the code is just broken.
> 
> Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it
> will possibly call too many CPU's, but this fallback code is just
> completely disgusting.
> 
> Do a simple and clean implementation. Then, if you can show real
> performance issues (which I doubt), maybe do something else, but even
> then you should never do something that will effectively create cases
> that have absolutely zero test-coverage.

A few points worth mentioning here:

1) As I stated earlier, using mm_cpumask in its current form is not
   an option for membarrier. For two reasons:

   A) The mask is not populated on all architectures (e.g. arm64 does
  not populate it),

   B) Even if it was populated on all architectures, we would need to
  carefully audit and document every spot where this mm_cpumask
  is set or cleared within each architecture code, and ensure we
  have the required memory barriers between user-space memory
  accesses and those stores, documenting those requirements into
  each architecture code in the process. This seems to be a lot of
  useless error-prone code churn.

2) I should actually use GFP_KERNEL rather than GFP_NOWAIT in this
   membarrier registration code. But it can still fail. However, the other
   membarrier code using the same fallback pattern (private and global
   expedited) documents that those membarrier commands do not block in
   the membarrier(2) man page, so GFP_NOWAIT is appropriate in those cases.

3) Testing-wise, I fully agree with your argument of lacking test coverage.
   One option I'm considering would be to add a selftest based on the
   fault-injection infrastructure, which would ensure that we have coverage
   of the failure case in the kernel selftests.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-09 Thread Oleg Nesterov
On 09/08, Mathieu Desnoyers wrote:
>
> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
> +{
> + int membarrier_state = atomic_read(>membarrier_state);
> + bool fallback = false;
> + cpumask_var_t tmpmask;
> + int cpu;
> +
> + if (atomic_read(>mm_users) == 1 || num_online_cpus() == 1) {
> + WRITE_ONCE(this_rq()->membarrier_state, membarrier_state);

This doesn't look safe, this caller can migrate to another CPU after
it calculates the per-cpu ptr.

I think you need do disable preemption or simply use this_cpu_write().

Oleg.



Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-08 Thread Linus Torvalds
On Sun, Sep 8, 2019 at 6:49 AM Mathieu Desnoyers
 wrote:
>
> +static void sync_runqueues_membarrier_state(struct mm_struct *mm)
> +{
> +   int membarrier_state = atomic_read(>membarrier_state);
> +   bool fallback = false;
> +   cpumask_var_t tmpmask;
> +
> +   if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
> +   /* Fallback for OOM. */
> +   fallback = true;
> +   }
> +
> +   /*
> +* For each cpu runqueue, if the task's mm match @mm, ensure that all
> +* @mm's membarrier state set bits are also set in in the runqueue's
> +* membarrier state. This ensures that a runqueue scheduling
> +* between threads which are users of @mm has its membarrier state
> +* updated.
> +*/
> +   cpus_read_lock();
> +   rcu_read_lock();
> +   for_each_online_cpu(cpu) {
> +   struct rq *rq = cpu_rq(cpu);
> +   struct task_struct *p;
> +
> +   p = task_rcu_dereference(>curr);
> +   if (p && p->mm == mm) {
> +   if (!fallback)
> +   __cpumask_set_cpu(cpu, tmpmask);
> +   else
> +   smp_call_function_single(cpu, 
> ipi_sync_rq_state,
> +mm, 1);
> +   }
> +   }

I really absolutely detest this whole "fallback" code.

It will never get any real testing, and the code is just broken.

Why don't you just use the mm_cpumask(mm) unconditionally? Yes, it
will possibly call too many CPU's, but this fallback code is just
completely disgusting.

Do a simple and clean implementation. Then, if you can show real
performance issues (which I doubt), maybe do something else, but even
then you should never do something that will effectively create cases
that have absolutely zero test-coverage.

  Linus


[RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state racy load (v2)

2019-09-08 Thread Mathieu Desnoyers
The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds 
Signed-off-by: Mathieu Desnoyers 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Linus Torvalds 
Cc: Russell King - ARM Linux admin 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Kirill Tkhai 
Cc: Mike Galbraith 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
---
Changes since v1:
- Take care of Peter Zijlstra's feedback, moving callsites closer
  to switch_mm() (scheduler) and activate_mm() (execve).
- Add memory barrier in membarrier_exec_mmap() prior to clearing
  the membarrier state.
---
 fs/exec.c |   2 +-
 include/linux/mm_types.h  |  14 +++-
 include/linux/sched/mm.h  |   8 +-
 kernel/sched/core.c   |   4 +-
 kernel/sched/membarrier.c | 170 --
 kernel/sched/sched.h  |  34 
 6 files changed, 180 insertions(+), 52 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index f7f6a140856a..ed39e2c81338 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1035,6 +1035,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->mm = mm;
tsk->active_mm = mm;
+   membarrier_exec_mmap(mm);
activate_mm(active_mm, mm);
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
@@ -1825,7 +1826,6 @@ static int __do_execve_file(int fd, struct filename 
*filename,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
-   membarrier_execve(current);
rseq_execve(current);
acct_update_integrals(current);
task_numa_free(current, false);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..ec9bd3a6c827 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -383,6 +383,16 @@ struct mm_struct {
unsigned long highest_vm_end;   /* highest vma end address */
pgd_t * pgd;
 
+#ifdef CONFIG_MEMBARRIER
+   /**
+* @membarrier_state: Flags controlling membarrier behavior.
+*
+* This field is close to @pgd to hopefully fit in the same
+* cache-line, which needs to be touched by switch_mm().
+*/
+   atomic_t membarrier_state;
+#endif
+
/**
 * @mm_users: The number of users including userspace.
 *
@@ -452,9 +462,7 @@ struct mm_struct {
unsigned long flags; /* Must use atomic bitops to access */
 
struct core_state *core_state; /* coredumping support */
-#ifdef CONFIG_MEMBARRIER
-   atomic_t membarrier_state;
-#endif
+
 #ifdef CONFIG_AIO
spinlock_t  ioctx_lock;
struct kioctx_table __rcu   *ioctx_table;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8557ec664213..e6770012db18 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -370,10 +370,8 @@ static inline void 
membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
sync_core_before_usermode();
 }
 
-static inline void membarrier_execve(struct task_struct *t)
-{
-   atomic_set(>mm->membarrier_state, 0);
-}
+extern void membarrier_exec_mmap(struct mm_struct *mm);
+
 #else
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
@@ -382,7 +380,7 @@ static inline void membarrier_arch_switch_mm(struct 
mm_struct *prev,
 {
 }
 #endif
-static inline void membarrier_execve(struct