Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers
- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
>> - On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org 
>> wrote:
>> 
>> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
>> > 
>> >> That being said, the x86 sync core gap that I imagined could be fixed
>> >> by changing to rq->curr == rq->idle test does not actually exist because
>> >> the global membarrier does not have a sync core option. So fixing the
>> >> exit_lazy_tlb points that this series does *should* fix that. So
>> >> PF_KTHREAD may be less problematic than I thought from implementation
>> >> point of view, only semantics.
>> > 
>> > So I've been trying to figure out where that PF_KTHREAD comes from,
>> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
>> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
>> > 
>> > So the first version:
>> > 
>> >  
>> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
>> > 
>> > appears to unconditionally send the IPI and checks p->mm in the IPI
>> > context, but then v2:
>> > 
>> >  
>> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
>> > 
>> > has the current code. But I've been unable to find the reason the
>> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
>> 
>> Looking back at my inbox, it seems like you are the one who proposed to
>> skip all kthreads:
>> 
>> https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net
> 
> I had a feeling it might've been me ;-) I just couldn't find the email.
> 
>> > The comment doesn't really help either; sure we have the whole lazy mm
>> > thing, but that's ->active_mm, not ->mm.
>> > 
>> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
>> > make the remote p->mm test work? Or were we over-eager with the !p->mm
>> > doesn't imply kthread 'cleanups' at the time?
>> 
>> The nice thing about adding back kthreads to the threads considered for
>> membarrier
>> IPI is that it has no observable effect on the user-space ABI. No 
>> pre-existing
>> kthread
>> rely on this, and we just provide an additional guarantee for future kthread
>> implementations.
>> 
>> > Also, I just realized, I still have a fix for use_mm() now
>> > kthread_use_mm() that seems to have been lost.
>> 
>> I suspect we need to at least document the memory barriers in kthread_use_mm 
>> and
>> kthread_unuse_mm to state that they are required by membarrier if we want to
>> ipi kthreads as well.
> 
> Right, so going by that email you found it was mostly a case of being
> lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
> any other bits that might be needed, covering kthreads should be
> possible.
> 
> No objections from me for making it so.

I'm OK on making membarrier cover kthreads using mm as well, provided we
audit kthread_{,un}use_mm() to make sure the proper barriers are in place
after setting task->mm and before clearing it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Peter Zijlstra
On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
> - On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org wrote:
> 
> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> > 
> >> That being said, the x86 sync core gap that I imagined could be fixed
> >> by changing to rq->curr == rq->idle test does not actually exist because
> >> the global membarrier does not have a sync core option. So fixing the
> >> exit_lazy_tlb points that this series does *should* fix that. So
> >> PF_KTHREAD may be less problematic than I thought from implementation
> >> point of view, only semantics.
> > 
> > So I've been trying to figure out where that PF_KTHREAD comes from,
> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> > 
> > So the first version:
> > 
> >  
> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
> > 
> > appears to unconditionally send the IPI and checks p->mm in the IPI
> > context, but then v2:
> > 
> >  
> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
> > 
> > has the current code. But I've been unable to find the reason the
> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
> 
> Looking back at my inbox, it seems like you are the one who proposed to
> skip all kthreads: 
> 
> https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net

I had a feeling it might've been me ;-) I just couldn't find the email.

> > The comment doesn't really help either; sure we have the whole lazy mm
> > thing, but that's ->active_mm, not ->mm.
> > 
> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
> > make the remote p->mm test work? Or were we over-eager with the !p->mm
> > doesn't imply kthread 'cleanups' at the time?
> 
> The nice thing about adding back kthreads to the threads considered for 
> membarrier
> IPI is that it has no observable effect on the user-space ABI. No 
> pre-existing kthread
> rely on this, and we just provide an additional guarantee for future kthread
> implementations.
> 
> > Also, I just realized, I still have a fix for use_mm() now
> > kthread_use_mm() that seems to have been lost.
> 
> I suspect we need to at least document the memory barriers in kthread_use_mm 
> and
> kthread_unuse_mm to state that they are required by membarrier if we want to
> ipi kthreads as well.

Right, so going by that email you found it was mostly a case of being
lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
any other bits that might be needed, covering kthreads should be
possible.

No objections from me for making it so.


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers
- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> So I've been trying to figure out where that PF_KTHREAD comes from,
> commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> 
> So the first version:
> 
>  
> https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
> 
> appears to unconditionally send the IPI and checks p->mm in the IPI
> context, but then v2:
> 
>  
> https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
> 
> has the current code. But I've been unable to find the reason the
> 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

Looking back at my inbox, it seems like you are the one who proposed to
skip all kthreads: 

https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net

> 
> The comment doesn't really help either; sure we have the whole lazy mm
> thing, but that's ->active_mm, not ->mm.
> 
> Possibly it is because {,un}use_mm() do not have sufficient barriers to
> make the remote p->mm test work? Or were we over-eager with the !p->mm
> doesn't imply kthread 'cleanups' at the time?

The nice thing about adding back kthreads to the threads considered for 
membarrier
IPI is that it has no observable effect on the user-space ABI. No pre-existing 
kthread
rely on this, and we just provide an additional guarantee for future kthread
implementations.

> Also, I just realized, I still have a fix for use_mm() now
> kthread_use_mm() that seems to have been lost.

I suspect we need to at least document the memory barriers in kthread_use_mm and
kthread_unuse_mm to state that they are required by membarrier if we want to
ipi kthreads as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread peterz
On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:

> That being said, the x86 sync core gap that I imagined could be fixed 
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

So I've been trying to figure out where that PF_KTHREAD comes from,
commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.

So the first version:

  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com

appears to unconditionally send the IPI and checks p->mm in the IPI
context, but then v2:

  
https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com

has the current code. But I've been unable to find the reason the
'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

The comment doesn't really help either; sure we have the whole lazy mm
thing, but that's ->active_mm, not ->mm.

Possibly it is because {,un}use_mm() do not have sufficient barriers to
make the remote p->mm test work? Or were we over-eager with the !p->mm
doesn't imply kthread 'cleanups' at the time?

Also, I just realized, I still have a fix for use_mm() now
kthread_use_mm() that seems to have been lost.




Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Nicholas Piggin
Excerpts from Mathieu Desnoyers's message of July 21, 2020 11:11 pm:
> - On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npig...@gmail.com wrote:
> 
>> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
> [...]
>> 
>> Yeah you're probably right in this case I think. Quite likely most kernel
>> tasks that asynchronously write to user memory would at least have some
>> kind of producer-consumer barriers.
>> 
>> But is that restriction of all async modifications documented and enforced
>> anywhere?
>> 
 How about other memory accesses via kthread_use_mm? Presumably there is
 still ordering requirement there for membarrier,
>>> 
>>> Please provide an example case with memory accesses via kthread_use_mm where
>>> ordering matters to support your concern.
>> 
>> I think the concern Andy raised with io_uring was less a specific
>> problem he saw and more a general concern that we have these memory
>> accesses which are not synchronized with membarrier.
>> 
 so I really think
 it's a fragile interface with no real way for the user to know how
 kernel threads may use its mm for any particular reason, so membarrier
 should synchronize all possible kernel users as well.
>>> 
>>> I strongly doubt so, but perhaps something should be clarified in the
>>> documentation
>>> if you have that feeling.
>> 
>> I'd rather go the other way and say if you have reasoning or numbers for
>> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
>> then we could think about keeping this subtlety with appropriate
>> documentation added, otherwise we can just kill it and remove all doubt.
>> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> Today, the membarrier global expedited command explicitly skips kernel 
> threads,
> but it happens that membarrier private expedited considers those with the
> same mm as target for the IPI.
> 
> So we already implement a semantic which differs between private and global
> expedited membarriers.

Which is not a good thing.

> This can be explained in part by the fact that
> kthread_use_mm was introduced after 4.16, where the most recent membarrier
> commands where introduced. It seems that the effect on membarrier was not
> considered when kthread_use_mm was introduced.

No it was just renamed, it used to be called use_mm and has been in the 
kernel for ~ever.

That you hadn't considered this is actually weight for my point, which 
is that there's so much subtle behaviour that's easy to miss we're 
better off with simpler and fewer special cases until it's proven 
they're needed. Not the other way around.

> 
> Looking at membarrier(2) documentation, it states that IPIs are only sent to
> threads belonging to the same process as the calling thread. If my 
> understanding
> of the notion of process is correct, this should rule out sending the IPI to
> kernel threads, given they are not "part" of the same process, only borrowing
> the mm. But I agree that the distinction is moot, and should be clarified.

It does if you read it in a user-hostile legalistic way. The reality is 
userspace shouldn't and can't know about how the kernel might implement 
functionality.

> Without a clear use-case to justify adding a constraint on membarrier, I am
> tempted to simply clarify documentation of current membarrier commands,
> stating clearly that they are not guaranteed to affect kernel threads. Then,
> if we have a compelling use-case to implement a different behavior which 
> covers
> kthreads, this could be added consistently across membarrier commands with a
> flag (or by adding new commands).
> 
> Does this approach make sense ?

The other position is without a clear use case for PF_KTHREAD, seeing as 
async kernel accesses had not been considered before now, we limit the 
optimision to only skipping the idle thread. I think that makes more 
sense (unless you have a reason for PF_KTHREAD but it doesn't seem like 
there is much of one).

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers
- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
[...]
> 
> Yeah you're probably right in this case I think. Quite likely most kernel
> tasks that asynchronously write to user memory would at least have some
> kind of producer-consumer barriers.
> 
> But is that restriction of all async modifications documented and enforced
> anywhere?
> 
>>> How about other memory accesses via kthread_use_mm? Presumably there is
>>> still ordering requirement there for membarrier,
>> 
>> Please provide an example case with memory accesses via kthread_use_mm where
>> ordering matters to support your concern.
> 
> I think the concern Andy raised with io_uring was less a specific
> problem he saw and more a general concern that we have these memory
> accesses which are not synchronized with membarrier.
> 
>>> so I really think
>>> it's a fragile interface with no real way for the user to know how
>>> kernel threads may use its mm for any particular reason, so membarrier
>>> should synchronize all possible kernel users as well.
>> 
>> I strongly doubt so, but perhaps something should be clarified in the
>> documentation
>> if you have that feeling.
> 
> I'd rather go the other way and say if you have reasoning or numbers for
> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
> then we could think about keeping this subtlety with appropriate
> documentation added, otherwise we can just kill it and remove all doubt.
> 
> That being said, the x86 sync core gap that I imagined could be fixed
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

Today, the membarrier global expedited command explicitly skips kernel threads,
but it happens that membarrier private expedited considers those with the
same mm as target for the IPI.

So we already implement a semantic which differs between private and global
expedited membarriers. This can be explained in part by the fact that
kthread_use_mm was introduced after 4.16, where the most recent membarrier
commands where introduced. It seems that the effect on membarrier was not
considered when kthread_use_mm was introduced.

Looking at membarrier(2) documentation, it states that IPIs are only sent to
threads belonging to the same process as the calling thread. If my understanding
of the notion of process is correct, this should rule out sending the IPI to
kernel threads, given they are not "part" of the same process, only borrowing
the mm. But I agree that the distinction is moot, and should be clarified.

Without a clear use-case to justify adding a constraint on membarrier, I am
tempted to simply clarify documentation of current membarrier commands,
stating clearly that they are not guaranteed to affect kernel threads. Then,
if we have a compelling use-case to implement a different behavior which covers
kthreads, this could be added consistently across membarrier commands with a
flag (or by adding new commands).

Does this approach make sense ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Nicholas Piggin
Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
> - On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npig...@gmail.com wrote:
> 
>> Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm:
>>> - On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
>>> [...]
 
 membarrier does replace barrier instructions on remote CPUs, which do
 order accesses performed by the kernel on the user address space. So
 membarrier should too I guess.
 
 Normal process context accesses like read(2) will do so because they
 don't get filtered out from IPIs, but kernel threads using the mm may
 not.
>>> 
>>> But it should not be an issue, because membarrier's ordering is only with
>>> respect
>>> to submit and completion of io_uring requests, which are performed through
>>> system calls from the context of user-space threads, which are called from 
>>> the
>>> right mm.
>> 
>> Is that true? Can io completions be written into an address space via a
>> kernel thread? I don't know the io_uring code well but it looks like
>> that's asynchonously using the user mm context.
> 
> Indeed, the io completion appears to be signaled asynchronously between kernel
> and user-space.

Yep, many other places do similar with use_mm.

[snip]

> So as far as membarrier memory ordering dependencies are concerned, it relies
> on the store-release/load-acquire dependency chain in the completion queue to
> order against anything that was done prior to the completed requests.
> 
> What is in-flight while the requests are being serviced provides no memory
> ordering guarantee whatsoever.

Yeah you're probably right in this case I think. Quite likely most kernel 
tasks that asynchronously write to user memory would at least have some 
kind of producer-consumer barriers.

But is that restriction of all async modifications documented and enforced
anywhere?

>> How about other memory accesses via kthread_use_mm? Presumably there is
>> still ordering requirement there for membarrier,
> 
> Please provide an example case with memory accesses via kthread_use_mm where
> ordering matters to support your concern.

I think the concern Andy raised with io_uring was less a specific 
problem he saw and more a general concern that we have these memory 
accesses which are not synchronized with membarrier.

>> so I really think
>> it's a fragile interface with no real way for the user to know how
>> kernel threads may use its mm for any particular reason, so membarrier
>> should synchronize all possible kernel users as well.
> 
> I strongly doubt so, but perhaps something should be clarified in the 
> documentation
> if you have that feeling.

I'd rather go the other way and say if you have reasoning or numbers for 
why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
then we could think about keeping this subtlety with appropriate 
documentation added, otherwise we can just kill it and remove all doubt.

That being said, the x86 sync core gap that I imagined could be fixed 
by changing to rq->curr == rq->idle test does not actually exist because
the global membarrier does not have a sync core option. So fixing the
exit_lazy_tlb points that this series does *should* fix that. So
PF_KTHREAD may be less problematic than I thought from implementation
point of view, only semantics.

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-20 Thread Mathieu Desnoyers
- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm:
>> - On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
>> [...]
>>> 
>>> membarrier does replace barrier instructions on remote CPUs, which do
>>> order accesses performed by the kernel on the user address space. So
>>> membarrier should too I guess.
>>> 
>>> Normal process context accesses like read(2) will do so because they
>>> don't get filtered out from IPIs, but kernel threads using the mm may
>>> not.
>> 
>> But it should not be an issue, because membarrier's ordering is only with
>> respect
>> to submit and completion of io_uring requests, which are performed through
>> system calls from the context of user-space threads, which are called from 
>> the
>> right mm.
> 
> Is that true? Can io completions be written into an address space via a
> kernel thread? I don't know the io_uring code well but it looks like
> that's asynchonously using the user mm context.

Indeed, the io completion appears to be signaled asynchronously between kernel
and user-space. Therefore, both kernel and userspace code need to have proper
memory barriers in place to signal completion, otherwise user-space could read
garbage after it notices completion of a read.

I did not review the entire io_uring implementation, but the publish side
for completion appears to be:

static void __io_commit_cqring(struct io_ring_ctx *ctx)
{
struct io_rings *rings = ctx->rings;

/* order cqe stores with ring update */
smp_store_release(>cq.tail, ctx->cached_cq_tail);

if (wq_has_sleeper(>cq_wait)) {
wake_up_interruptible(>cq_wait);
kill_fasync(>cq_fasync, SIGIO, POLL_IN);
}
}

The store-release on tail should be paired with a load_acquire on the
reader-side (it's called "read_barrier()" in the code):

tools/io_uring/queue.c:

static int __io_uring_get_cqe(struct io_uring *ring,
  struct io_uring_cqe **cqe_ptr, int wait)
{
struct io_uring_cq *cq = >cq;
const unsigned mask = *cq->kring_mask;
unsigned head;
int ret;

*cqe_ptr = NULL;
head = *cq->khead;
do {
/*
 * It's necessary to use a read_barrier() before reading
 * the CQ tail, since the kernel updates it locklessly. The
 * kernel has the matching store barrier for the update. The
 * kernel also ensures that previous stores to CQEs are ordered
 * with the tail update.
 */
read_barrier();
if (head != *cq->ktail) {
*cqe_ptr = >cqes[head & mask];
break;
}
if (!wait)
break;
ret = io_uring_enter(ring->ring_fd, 0, 1,
IORING_ENTER_GETEVENTS, NULL);
if (ret < 0)
return -errno;
} while (1);

return 0;
}

So as far as membarrier memory ordering dependencies are concerned, it relies
on the store-release/load-acquire dependency chain in the completion queue to
order against anything that was done prior to the completed requests.

What is in-flight while the requests are being serviced provides no memory
ordering guarantee whatsoever.

> How about other memory accesses via kthread_use_mm? Presumably there is
> still ordering requirement there for membarrier,

Please provide an example case with memory accesses via kthread_use_mm where
ordering matters to support your concern.

> so I really think
> it's a fragile interface with no real way for the user to know how
> kernel threads may use its mm for any particular reason, so membarrier
> should synchronize all possible kernel users as well.

I strongly doubt so, but perhaps something should be clarified in the 
documentation
if you have that feeling.

Thanks,

Mathieu

> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-19 Thread Nicholas Piggin
Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm:
> - On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
> [...]
>> 
>> membarrier does replace barrier instructions on remote CPUs, which do
>> order accesses performed by the kernel on the user address space. So
>> membarrier should too I guess.
>> 
>> Normal process context accesses like read(2) will do so because they
>> don't get filtered out from IPIs, but kernel threads using the mm may
>> not.
> 
> But it should not be an issue, because membarrier's ordering is only with 
> respect
> to submit and completion of io_uring requests, which are performed through
> system calls from the context of user-space threads, which are called from the
> right mm.

Is that true? Can io completions be written into an address space via a
kernel thread? I don't know the io_uring code well but it looks like 
that's asynchonously using the user mm context.

How about other memory accesses via kthread_use_mm? Presumably there is 
still ordering requirement there for membarrier, so I really think
it's a fragile interface with no real way for the user to know how 
kernel threads may use its mm for any particular reason, so membarrier
should synchronize all possible kernel users as well.

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 1:44 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> >> > I agree with Nick: A memory barrier is needed somewhere between the
>> >> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> >> > with the Store Buffer pattern having a memory barrier on only one side,
>> >> > and it is well known that this arrangement does not guarantee any
>> >> > ordering.
>> >> 
>> >> Yes, I see this now. I'm still trying to wrap my head around why the 
>> >> memory
>> >> barrier at the end of membarrier() needs to be paired with a scheduler
>> >> barrier though.
>> > 
>> > The memory barrier at the end of membarrier() on CPU0 is necessary in
>> > order to enforce the guarantee that any writes occurring on CPU1 before
>> > the membarrier() is executed will be visible to any code executing on
>> > CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
>> > 
>> >CPU0CPU1
>> >x = 1
>> >barrier()
>> >y = 1
>> >r2 = y
>> >membarrier():
>> >  a: smp_mb()
>> >  b: send IPI   IPI-induced mb
>> >  c: smp_mb()
>> >r1 = x
>> > 
>> > The writes to x and y are unordered by the hardware, so it's possible to
>> > have r2 = 1 even though the write to x doesn't execute until b.  If the
>> > memory barrier at c is omitted then "r1 = x" can be reordered before b
>> > (although not before a), so we get r1 = 0.  This violates the guarantee
>> > that membarrier() is supposed to provide.
>> > 
>> > The timing of the memory barrier at c has to ensure that it executes
>> > after the IPI-induced memory barrier on CPU1.  If it happened before
>> > then we could still end up with r1 = 0.  That's why the pairing matters.
>> > 
>> > I hope this helps your head get properly wrapped.  :-)
>> 
>> It does help a bit! ;-)
>> 
>> This explains this part of the comment near the smp_mb at the end of 
>> membarrier:
>> 
>>  * Memory barrier on the caller thread _after_ we finished
>>  * waiting for the last IPI. [...]
>> 
>> However, it does not explain why it needs to be paired with a barrier in the
>> scheduler, clearly for the case where the IPI is skipped. I wonder whether 
>> this
>> part
>> of the comment is factually correct:
>> 
>>  * [...] Matches memory barriers around rq->curr modification in 
>> scheduler.
> 
> The reasoning is pretty much the same as above:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
>   switch to kthread (includes mb)
> b: read rq->curr == kthread
>   switch to user (includes mb)
> c: smp_mb()
>   r1 = x
> 
> Once again, it is possible that x = 1 doesn't become visible to CPU0
> until shortly before b.  But if c is omitted then "r1 = x" can be
> reordered before b (to any time after a), so we can have r1 = 0.
> 
> Here the timing requirement is that c executes after the first memory
> barrier on CPU1 -- which is one of the ones around the rq->curr
> modification.  (In fact, in this scenario CPU1's switch back to the user
> process is irrelevant.)

That indeed covers the last scenario I was wondering about. Thanks Alan!

Mathieu

> 
> Alan Stern

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Alan Stern
On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote:
> - On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu 
> wrote:
> 
> >> > I agree with Nick: A memory barrier is needed somewhere between the
> >> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
> >> > with the Store Buffer pattern having a memory barrier on only one side,
> >> > and it is well known that this arrangement does not guarantee any
> >> > ordering.
> >> 
> >> Yes, I see this now. I'm still trying to wrap my head around why the memory
> >> barrier at the end of membarrier() needs to be paired with a scheduler
> >> barrier though.
> > 
> > The memory barrier at the end of membarrier() on CPU0 is necessary in
> > order to enforce the guarantee that any writes occurring on CPU1 before
> > the membarrier() is executed will be visible to any code executing on
> > CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
> > 
> > CPU0CPU1
> > x = 1
> > barrier()
> > y = 1
> > r2 = y
> > membarrier():
> >   a: smp_mb()
> >   b: send IPI   IPI-induced mb
> >   c: smp_mb()
> > r1 = x
> > 
> > The writes to x and y are unordered by the hardware, so it's possible to
> > have r2 = 1 even though the write to x doesn't execute until b.  If the
> > memory barrier at c is omitted then "r1 = x" can be reordered before b
> > (although not before a), so we get r1 = 0.  This violates the guarantee
> > that membarrier() is supposed to provide.
> > 
> > The timing of the memory barrier at c has to ensure that it executes
> > after the IPI-induced memory barrier on CPU1.  If it happened before
> > then we could still end up with r1 = 0.  That's why the pairing matters.
> > 
> > I hope this helps your head get properly wrapped.  :-)
> 
> It does help a bit! ;-)
> 
> This explains this part of the comment near the smp_mb at the end of 
> membarrier:
> 
>  * Memory barrier on the caller thread _after_ we finished
>  * waiting for the last IPI. [...]
> 
> However, it does not explain why it needs to be paired with a barrier in the
> scheduler, clearly for the case where the IPI is skipped. I wonder whether 
> this part
> of the comment is factually correct:
> 
>  * [...] Matches memory barriers around rq->curr modification in 
> scheduler.

The reasoning is pretty much the same as above:

CPU0CPU1
x = 1
barrier()
y = 1
r2 = y
membarrier():
  a: smp_mb()
switch to kthread (includes mb)
  b: read rq->curr == kthread
switch to user (includes mb)
  c: smp_mb()
r1 = x

Once again, it is possible that x = 1 doesn't become visible to CPU0 
until shortly before b.  But if c is omitted then "r1 = x" can be 
reordered before b (to any time after a), so we can have r1 = 0.

Here the timing requirement is that c executes after the first memory 
barrier on CPU1 -- which is one of the ones around the rq->curr 
modification.  (In fact, in this scenario CPU1's switch back to the user 
process is irrelevant.)

Alan Stern


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu wrote:

>> > I agree with Nick: A memory barrier is needed somewhere between the
>> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> > with the Store Buffer pattern having a memory barrier on only one side,
>> > and it is well known that this arrangement does not guarantee any
>> > ordering.
>> 
>> Yes, I see this now. I'm still trying to wrap my head around why the memory
>> barrier at the end of membarrier() needs to be paired with a scheduler
>> barrier though.
> 
> The memory barrier at the end of membarrier() on CPU0 is necessary in
> order to enforce the guarantee that any writes occurring on CPU1 before
> the membarrier() is executed will be visible to any code executing on
> CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
> b: send IPI   IPI-induced mb
> c: smp_mb()
>   r1 = x
> 
> The writes to x and y are unordered by the hardware, so it's possible to
> have r2 = 1 even though the write to x doesn't execute until b.  If the
> memory barrier at c is omitted then "r1 = x" can be reordered before b
> (although not before a), so we get r1 = 0.  This violates the guarantee
> that membarrier() is supposed to provide.
> 
> The timing of the memory barrier at c has to ensure that it executes
> after the IPI-induced memory barrier on CPU1.  If it happened before
> then we could still end up with r1 = 0.  That's why the pairing matters.
> 
> I hope this helps your head get properly wrapped.  :-)

It does help a bit! ;-)

This explains this part of the comment near the smp_mb at the end of membarrier:

 * Memory barrier on the caller thread _after_ we finished
 * waiting for the last IPI. [...]

However, it does not explain why it needs to be paired with a barrier in the
scheduler, clearly for the case where the IPI is skipped. I wonder whether this 
part
of the comment is factually correct:

 * [...] Matches memory barriers around rq->curr modification in 
scheduler.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Alan Stern
> > I agree with Nick: A memory barrier is needed somewhere between the
> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
> > with the Store Buffer pattern having a memory barrier on only one side,
> > and it is well known that this arrangement does not guarantee any
> > ordering.
> 
> Yes, I see this now. I'm still trying to wrap my head around why the memory
> barrier at the end of membarrier() needs to be paired with a scheduler
> barrier though.

The memory barrier at the end of membarrier() on CPU0 is necessary in 
order to enforce the guarantee that any writes occurring on CPU1 before 
the membarrier() is executed will be visible to any code executing on 
CPU0 after the membarrier().  Ignoring the kthread issue, we can have:

CPU0CPU1
x = 1
barrier()
y = 1
r2 = y
membarrier():
  a: smp_mb()
  b: send IPI   IPI-induced mb
  c: smp_mb()
r1 = x

The writes to x and y are unordered by the hardware, so it's possible to 
have r2 = 1 even though the write to x doesn't execute until b.  If the 
memory barrier at c is omitted then "r1 = x" can be reordered before b 
(although not before a), so we get r1 = 0.  This violates the guarantee 
that membarrier() is supposed to provide.

The timing of the memory barrier at c has to ensure that it executes 
after the IPI-induced memory barrier on CPU1.  If it happened before 
then we could still end up with r1 = 0.  That's why the pairing matters.

I hope this helps your head get properly wrapped.  :-)

Alan Stern


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 10:51 AM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> >> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> >> mathieu.desnoy...@efficios.com wrote:
>> >> 
>> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> >> > mathieu.desnoy...@efficios.com wrote:
>> >> > 
>> >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> >> wrote:
>> >> >>> I should be more complete here, especially since I was complaining
>> >> >>> about unclear barrier comment :)
>> >> >>> 
>> >> >>> 
>> >> >>> CPU0 CPU1
>> >> >>> a. user stuff1. user stuff
>> >> >>> b. membarrier()  2. enter kernel
>> >> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >> >>> f. return to user6. rq->curr switched to user thread
>> >> >>> g. user stuff7. switch_to user thread
>> >> >>> 8. exit kernel
>> >> >>> 9. more user stuff
> 
> ...
> 
>> >> Requiring a memory barrier between update of rq->curr (back to current 
>> >> process's
>> >> thread) and following user-space memory accesses does not seem to 
>> >> guarantee
>> >> anything more than what the initial barrier at the beginning of __schedule
>> >> already
>> >> provides, because the guarantees are only about accesses to user-space 
>> >> memory.
> 
> ...
> 
>> > Is it correct to say that the switch_to operations in 5 and 7 include
>> > memory barriers?  If they do, then skipping the IPI should be okay.
>> > 
>> > The reason is as follows: The guarantee you need to enforce is that
>> > anything written by CPU0 before the membarrier() will be visible to CPU1
>> > after it returns to user mode.  Let's say that a writes to X and 9
>> > reads from X.
>> > 
>> > Then we have an instance of the Store Buffer pattern:
>> > 
>> >CPU0CPU1
>> >a. Write X  6. Write rq->curr for user thread
>> >c. smp_mb() 7. switch_to memory barrier
>> >d. Read rq->curr9. Read X
>> > 
>> > In this pattern, the memory barriers make it impossible for both reads
>> > to miss their corresponding writes.  Since d does fail to read 6 (it
>> > sees the earlier value stored by 4), 9 must read a.
>> > 
>> > The other guarantee you need is that g on CPU0 will observe anything
>> > written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
>> > memory barrier and d reads from 4.
>> 
>> Right, and Nick's reply involving pairs of loads/stores on each side
>> clarifies the situation even further.
> 
> The key part of my reply was the question: "Is it correct to say that
> the switch_to operations in 5 and 7 include memory barriers?"  From the
> text quoted above and from Nick's reply, it seems clear that they do
> not.

I remember that switch_mm implies it, but not switch_to.

The scenario that triggered this discussion is when the scheduler does a
lazy tlb entry/exit, which is basically switch from a user task to
a kernel thread without changing the mm, and usually switching back afterwards.
This optimization means the rq->curr mm temporarily differs, which prevent
IPIs from being sent by membarrier, but without involving a switch_mm.
This requires explicit memory barriers either on entry/exit of lazy tlb
mode, or explicit barriers in the scheduler for those special-cases.

> I agree with Nick: A memory barrier is needed somewhere between the
> assignment at 6 and the return to user mode at 8.  Otherwise you end up
> with the Store Buffer pattern having a memory barrier on only one side,
> and it is well known that this arrangement does not guarantee any
> ordering.

Yes, I see this now. I'm still trying to wrap my head around why the memory
barrier at the end of membarrier() needs to be paired with a scheduler
barrier though.

> One thing I don't understand about all this: Any context switch has to
> include a memory barrier somewhere, but both you and Nick seem to be
> saying that steps 6 and 7 don't include (or don't need) any memory
> barriers.  What am I missing?

All context switch have the smp_mb__before_spinlock at the beginning of
__schedule(), which I suspect is what you refer to. However this barrier
is before the store to rq->curr, not after.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Alan Stern
On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote:
> - On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu wrote:
> 
> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
> >> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
> >> mathieu.desnoy...@efficios.com wrote:
> >> 
> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> >> > mathieu.desnoy...@efficios.com wrote:
> >> > 
> >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
> >> >> wrote:
> >> >>> I should be more complete here, especially since I was complaining
> >> >>> about unclear barrier comment :)
> >> >>> 
> >> >>> 
> >> >>> CPU0 CPU1
> >> >>> a. user stuff1. user stuff
> >> >>> b. membarrier()  2. enter kernel
> >> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
> >> >>> d. read rq->curr 4. rq->curr switched to kthread
> >> >>> e. is kthread, skip IPI  5. switch_to kthread
> >> >>> f. return to user6. rq->curr switched to user thread
> >> >>> g. user stuff7. switch_to user thread
> >> >>> 8. exit kernel
> >> >>> 9. more user stuff

...

> >> Requiring a memory barrier between update of rq->curr (back to current 
> >> process's
> >> thread) and following user-space memory accesses does not seem to guarantee
> >> anything more than what the initial barrier at the beginning of __schedule
> >> already
> >> provides, because the guarantees are only about accesses to user-space 
> >> memory.

...

> > Is it correct to say that the switch_to operations in 5 and 7 include
> > memory barriers?  If they do, then skipping the IPI should be okay.
> > 
> > The reason is as follows: The guarantee you need to enforce is that
> > anything written by CPU0 before the membarrier() will be visible to CPU1
> > after it returns to user mode.  Let's say that a writes to X and 9
> > reads from X.
> > 
> > Then we have an instance of the Store Buffer pattern:
> > 
> > CPU0CPU1
> > a. Write X  6. Write rq->curr for user thread
> > c. smp_mb() 7. switch_to memory barrier
> > d. Read rq->curr9. Read X
> > 
> > In this pattern, the memory barriers make it impossible for both reads
> > to miss their corresponding writes.  Since d does fail to read 6 (it
> > sees the earlier value stored by 4), 9 must read a.
> > 
> > The other guarantee you need is that g on CPU0 will observe anything
> > written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
> > memory barrier and d reads from 4.
> 
> Right, and Nick's reply involving pairs of loads/stores on each side
> clarifies the situation even further.

The key part of my reply was the question: "Is it correct to say that 
the switch_to operations in 5 and 7 include memory barriers?"  From the 
text quoted above and from Nick's reply, it seems clear that they do 
not.

I agree with Nick: A memory barrier is needed somewhere between the 
assignment at 6 and the return to user mode at 8.  Otherwise you end up 
with the Store Buffer pattern having a memory barrier on only one side, 
and it is well known that this arrangement does not guarantee any 
ordering.

One thing I don't understand about all this: Any context switch has to 
include a memory barrier somewhere, but both you and Nick seem to be 
saying that steps 6 and 7 don't include (or don't need) any memory 
barriers.  What am I missing?

Alan Stern


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
[...]
> 
> membarrier does replace barrier instructions on remote CPUs, which do
> order accesses performed by the kernel on the user address space. So
> membarrier should too I guess.
> 
> Normal process context accesses like read(2) will do so because they
> don't get filtered out from IPIs, but kernel threads using the mm may
> not.

But it should not be an issue, because membarrier's ordering is only with 
respect
to submit and completion of io_uring requests, which are performed through
system calls from the context of user-space threads, which are called from the
right mm.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> > mathieu.desnoy...@efficios.com wrote:
>> > 
>> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> wrote:
>> >>> I should be more complete here, especially since I was complaining
>> >>> about unclear barrier comment :)
>> >>> 
>> >>> 
>> >>> CPU0 CPU1
>> >>> a. user stuff1. user stuff
>> >>> b. membarrier()  2. enter kernel
>> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >>> f. return to user6. rq->curr switched to user thread
>> >>> g. user stuff7. switch_to user thread
>> >>> 8. exit kernel
>> >>> 9. more user stuff
>> >>> 
>> >>> What you're really ordering is a, g vs 1, 9 right?
>> >>> 
>> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> >>> etc.
>> >>> 
>> >>> Userspace does not care where the barriers are exactly or what kernel
>> >>> memory accesses might be being ordered by them, so long as there is a
>> >>> mb somewhere between a and g, and 1 and 9. Right?
>> >> 
>> >> This is correct.
>> > 
>> > Actually, sorry, the above is not quite right. It's been a while
>> > since I looked into the details of membarrier.
>> > 
>> > The smp_mb() at the beginning of membarrier() needs to be paired with a
>> > smp_mb() _after_ rq->curr is switched back to the user thread, so the
>> > memory barrier is between store to rq->curr and following user-space
>> > accesses.
>> > 
>> > The smp_mb() at the end of membarrier() needs to be paired with the
>> > smp_mb__after_spinlock() at the beginning of schedule, which is
>> > between accesses to userspace memory and switching rq->curr to kthread.
>> > 
>> > As to *why* this ordering is needed, I'd have to dig through additional
>> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
>> 
>> Thinking further about this, I'm beginning to consider that maybe we have 
>> been
>> overly cautious by requiring memory barriers before and after store to 
>> rq->curr.
>> 
>> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process
>> (current)
>> while running the membarrier system call, it necessarily means that CPU1 had
>> to issue smp_mb__after_spinlock when entering the scheduler, between any
>> user-space
>> loads/stores and update of rq->curr.
>> 
>> Requiring a memory barrier between update of rq->curr (back to current 
>> process's
>> thread) and following user-space memory accesses does not seem to guarantee
>> anything more than what the initial barrier at the beginning of __schedule
>> already
>> provides, because the guarantees are only about accesses to user-space 
>> memory.
>> 
>> Therefore, with the memory barrier at the beginning of __schedule, just
>> observing that
>> CPU1's rq->curr differs from current should guarantee that a memory barrier 
>> was
>> issued
>> between any sequentially consistent instructions belonging to the current
>> process on
>> CPU1.
>> 
>> Or am I missing/misremembering an important point here ?
> 
> Is it correct to say that the switch_to operations in 5 and 7 include
> memory barriers?  If they do, then skipping the IPI should be okay.
> 
> The reason is as follows: The guarantee you need to enforce is that
> anything written by CPU0 before the membarrier() will be visible to CPU1
> after it returns to user mode.  Let's say that a writes to X and 9
> reads from X.
> 
> Then we have an instance of the Store Buffer pattern:
> 
>   CPU0CPU1
>   a. Write X  6. Write rq->curr for user thread
>   c. smp_mb() 7. switch_to memory barrier
>   d. Read rq->curr9. Read X
> 
> In this pattern, the memory barriers make it impossible for both reads
> to miss their corresponding writes.  Since d does fail to read 6 (it
> sees the earlier value stored by 4), 9 must read a.
> 
> The other guarantee you need is that g on CPU0 will observe anything
> written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
> memory barrier and d reads from 4.

Right, and Nick's reply involving pairs of loads/stores on each side
clarifies the situation even further.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Alan Stern
On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers 
> mathieu.desnoy...@efficios.com wrote:
> 
> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> > mathieu.desnoy...@efficios.com wrote:
> > 
> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
> >> wrote:
> >>> I should be more complete here, especially since I was complaining
> >>> about unclear barrier comment :)
> >>> 
> >>> 
> >>> CPU0 CPU1
> >>> a. user stuff1. user stuff
> >>> b. membarrier()  2. enter kernel
> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
> >>> d. read rq->curr 4. rq->curr switched to kthread
> >>> e. is kthread, skip IPI  5. switch_to kthread
> >>> f. return to user6. rq->curr switched to user thread
> >>> g. user stuff7. switch_to user thread
> >>> 8. exit kernel
> >>> 9. more user stuff
> >>> 
> >>> What you're really ordering is a, g vs 1, 9 right?
> >>> 
> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
> >>> etc.
> >>> 
> >>> Userspace does not care where the barriers are exactly or what kernel
> >>> memory accesses might be being ordered by them, so long as there is a
> >>> mb somewhere between a and g, and 1 and 9. Right?
> >> 
> >> This is correct.
> > 
> > Actually, sorry, the above is not quite right. It's been a while
> > since I looked into the details of membarrier.
> > 
> > The smp_mb() at the beginning of membarrier() needs to be paired with a
> > smp_mb() _after_ rq->curr is switched back to the user thread, so the
> > memory barrier is between store to rq->curr and following user-space
> > accesses.
> > 
> > The smp_mb() at the end of membarrier() needs to be paired with the
> > smp_mb__after_spinlock() at the beginning of schedule, which is
> > between accesses to userspace memory and switching rq->curr to kthread.
> > 
> > As to *why* this ordering is needed, I'd have to dig through additional
> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
> 
> Thinking further about this, I'm beginning to consider that maybe we have been
> overly cautious by requiring memory barriers before and after store to 
> rq->curr.
> 
> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process 
> (current)
> while running the membarrier system call, it necessarily means that CPU1 had
> to issue smp_mb__after_spinlock when entering the scheduler, between any 
> user-space
> loads/stores and update of rq->curr.
> 
> Requiring a memory barrier between update of rq->curr (back to current 
> process's
> thread) and following user-space memory accesses does not seem to guarantee
> anything more than what the initial barrier at the beginning of __schedule 
> already
> provides, because the guarantees are only about accesses to user-space memory.
> 
> Therefore, with the memory barrier at the beginning of __schedule, just 
> observing that
> CPU1's rq->curr differs from current should guarantee that a memory barrier 
> was issued
> between any sequentially consistent instructions belonging to the current 
> process on
> CPU1.
> 
> Or am I missing/misremembering an important point here ?

Is it correct to say that the switch_to operations in 5 and 7 include 
memory barriers?  If they do, then skipping the IPI should be okay.

The reason is as follows: The guarantee you need to enforce is that 
anything written by CPU0 before the membarrier() will be visible to CPU1 
after it returns to user mode.  Let's say that a writes to X and 9 
reads from X.

Then we have an instance of the Store Buffer pattern:

CPU0CPU1
a. Write X  6. Write rq->curr for user thread
c. smp_mb() 7. switch_to memory barrier
d. Read rq->curr9. Read X

In this pattern, the memory barriers make it impossible for both reads 
to miss their corresponding writes.  Since d does fail to read 6 (it 
sees the earlier value stored by 4), 9 must read a.

The other guarantee you need is that g on CPU0 will observe anything 
written by CPU1 in 1.  This is easier to see, using the fact that 3 is a 
memory barrier and d reads from 4.

Alan Stern


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Nicholas Piggin
Excerpts from Mathieu Desnoyers's message of July 17, 2020 4:58 am:
> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers 
> mathieu.desnoy...@efficios.com wrote:
> 
>> - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>>> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
 I should be more complete here, especially since I was complaining
 about unclear barrier comment :)
 
 
 CPU0 CPU1
 a. user stuff1. user stuff
 b. membarrier()  2. enter kernel
 c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
 d. read rq->curr 4. rq->curr switched to kthread
 e. is kthread, skip IPI  5. switch_to kthread
 f. return to user6. rq->curr switched to user thread
 g. user stuff7. switch_to user thread
 8. exit kernel
 9. more user stuff
 
 What you're really ordering is a, g vs 1, 9 right?
 
 In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
 etc.
 
 Userspace does not care where the barriers are exactly or what kernel
 memory accesses might be being ordered by them, so long as there is a
 mb somewhere between a and g, and 1 and 9. Right?
>>> 
>>> This is correct.
>> 
>> Actually, sorry, the above is not quite right. It's been a while
>> since I looked into the details of membarrier.
>> 
>> The smp_mb() at the beginning of membarrier() needs to be paired with a
>> smp_mb() _after_ rq->curr is switched back to the user thread, so the
>> memory barrier is between store to rq->curr and following user-space
>> accesses.
>> 
>> The smp_mb() at the end of membarrier() needs to be paired with the
>> smp_mb__after_spinlock() at the beginning of schedule, which is
>> between accesses to userspace memory and switching rq->curr to kthread.
>> 
>> As to *why* this ordering is needed, I'd have to dig through additional
>> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
> 
> Thinking further about this, I'm beginning to consider that maybe we have been
> overly cautious by requiring memory barriers before and after store to 
> rq->curr.
> 
> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process 
> (current)
> while running the membarrier system call, it necessarily means that CPU1 had
> to issue smp_mb__after_spinlock when entering the scheduler, between any 
> user-space
> loads/stores and update of rq->curr.
> 
> Requiring a memory barrier between update of rq->curr (back to current 
> process's
> thread) and following user-space memory accesses does not seem to guarantee
> anything more than what the initial barrier at the beginning of __schedule 
> already
> provides, because the guarantees are only about accesses to user-space memory.
> 
> Therefore, with the memory barrier at the beginning of __schedule, just 
> observing that
> CPU1's rq->curr differs from current should guarantee that a memory barrier 
> was issued
> between any sequentially consistent instructions belonging to the current 
> process on
> CPU1.
> 
> Or am I missing/misremembering an important point here ?

I might have mislead you.

 CPU0CPU1
 r1=yx=1
 membarrier()y=1
 r2=x

membarrier provides if r1==1 then r2==1 (right?)

 CPU0
 r1=y
 membarrier()
   smp_mb();
   t = cpu_rq(1)->curr;
   if (t->mm == mm)
 IPI(CPU1);
   smp_mb()
 r2=x

 vs

 CPU1
   ...
   __schedule()
 smp_mb__after_spinlock()
 rq->curr = kthread
   ...
   __schedule()
 smp_mb__after_spinlock()
 rq->curr = user thread
 exit kernel
 x=1
 y=1

Now these last 3 stores are not ordered, so CPU0 might see y==1 but
rq->curr == kthread, right? Then it will skip the IPI and stores to x 
and y will not be ordered.

So we do need a mb after rq->curr store when mm is switching.

I believe for the global membarrier PF_KTHREAD optimisation, we also 
need a barrier when switching from a kernel thread to user, for the
same reason.

So I think I was wrong to say the barrier is not necessary.

I haven't quite worked out why two mb()s are required in membarrier(),
but at least that's less of a performance concern.

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Nicholas Piggin
Excerpts from pet...@infradead.org's message of July 16, 2020 9:00 pm:
> On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm:
>> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
>> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
> 
>> >> But I’m wondering if all this deferred sync stuff is wrong. In the
>> >> brave new world of io_uring and such, perhaps kernel access matter
>> >> too.  Heck, even:
>> > 
>> > IIRC the membarrier SYNC_CORE use-case is about user-space
>> > self-modifying code.
>> > 
>> > Userspace re-uses a text address and needs to SYNC_CORE before it can be
>> > sure the old text is forgotten. Nothing the kernel does matters there.
>> > 
>> > I suppose the manpage could be more clear there.
>> 
>> True, but memory ordering of kernel stores from kernel threads for
>> regular mem barrier is the concern here.
>> 
>> Does io_uring update completion queue from kernel thread or interrupt,
>> for example? If it does, then membarrier will not order such stores
>> with user memory accesses.
> 
> So we're talking about regular membarrier() then? Not the SYNC_CORE
> variant per-se.

Well, both but Andy in this case was wondering about kernel writes
vs user.

> 
> Even there, I'll argue we don't care, but perhaps Mathieu has a
> different opinion. All we care about is that all other threads (or CPUs
> for GLOBAL) observe an smp_mb() before it returns.
> 
> Any serialization against whatever those other threads/CPUs are running
> at the instant of the syscall is external to the syscall, we make no
> gauarantees about that. That is, we can fundamentally not say what
> another CPU is executing concurrently. Nor should we want to.
> 
> So if you feel that your membarrier() ought to serialize against remote
> execution, you need to arrange a quiecent state on the remote side
> yourself.
> 
> Now, normally membarrier() is used to implement userspace RCU like
> things, and there all that matters is that the remote CPUs observe the
> beginngin of the new grace-period, ie counter flip, and we observe their
> read-side critical sections, or smething like that, it's been a while
> since I looked at all that.
> 
> It's always been the case that concurrent syscalls could change user
> memory, io_uring doesn't change that, it just makes it even less well
> defined when that would happen. If you want to serialize against that,
> you need to arrange that externally.

membarrier does replace barrier instructions on remote CPUs, which do
order accesses performed by the kernel on the user address space. So
membarrier should too I guess.

Normal process context accesses like read(2) will do so because they
don't get filtered out from IPIs, but kernel threads using the mm may
not.

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> mathieu.desnoy...@efficios.com wrote:
> 
>> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
>>> I should be more complete here, especially since I was complaining
>>> about unclear barrier comment :)
>>> 
>>> 
>>> CPU0 CPU1
>>> a. user stuff1. user stuff
>>> b. membarrier()  2. enter kernel
>>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>>> d. read rq->curr 4. rq->curr switched to kthread
>>> e. is kthread, skip IPI  5. switch_to kthread
>>> f. return to user6. rq->curr switched to user thread
>>> g. user stuff7. switch_to user thread
>>> 8. exit kernel
>>> 9. more user stuff
>>> 
>>> What you're really ordering is a, g vs 1, 9 right?
>>> 
>>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>>> etc.
>>> 
>>> Userspace does not care where the barriers are exactly or what kernel
>>> memory accesses might be being ordered by them, so long as there is a
>>> mb somewhere between a and g, and 1 and 9. Right?
>> 
>> This is correct.
> 
> Actually, sorry, the above is not quite right. It's been a while
> since I looked into the details of membarrier.
> 
> The smp_mb() at the beginning of membarrier() needs to be paired with a
> smp_mb() _after_ rq->curr is switched back to the user thread, so the
> memory barrier is between store to rq->curr and following user-space
> accesses.
> 
> The smp_mb() at the end of membarrier() needs to be paired with the
> smp_mb__after_spinlock() at the beginning of schedule, which is
> between accesses to userspace memory and switching rq->curr to kthread.
> 
> As to *why* this ordering is needed, I'd have to dig through additional
> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thinking further about this, I'm beginning to consider that maybe we have been
overly cautious by requiring memory barriers before and after store to rq->curr.

If CPU0 observes a CPU1's rq->curr->mm which differs from its own process 
(current)
while running the membarrier system call, it necessarily means that CPU1 had
to issue smp_mb__after_spinlock when entering the scheduler, between any 
user-space
loads/stores and update of rq->curr.

Requiring a memory barrier between update of rq->curr (back to current process's
thread) and following user-space memory accesses does not seem to guarantee
anything more than what the initial barrier at the beginning of __schedule 
already
provides, because the guarantees are only about accesses to user-space memory.

Therefore, with the memory barrier at the beginning of __schedule, just 
observing that
CPU1's rq->curr differs from current should guarantee that a memory barrier was 
issued
between any sequentially consistent instructions belonging to the current 
process on
CPU1.

Or am I missing/misremembering an important point here ?

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
>> Note that the accesses to user-space memory can be
>> done either by user-space code or kernel code, it doesn't matter.
>> However, in order to be considered as happening before/after
>> either membarrier or the matching compiler barrier, kernel code
>> needs to have causality relationship with user-space execution,
>> e.g. user-space does a system call, or returns from a system call.
>> 
>> In the case of io_uring, submitting a request or returning from waiting
>> on request completion appear to provide this causality relationship.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
>> I should be more complete here, especially since I was complaining
>> about unclear barrier comment :)
>> 
>> 
>> CPU0 CPU1
>> a. user stuff1. user stuff
>> b. membarrier()  2. enter kernel
>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> d. read rq->curr 4. rq->curr switched to kthread
>> e. is kthread, skip IPI  5. switch_to kthread
>> f. return to user6. rq->curr switched to user thread
>> g. user stuff7. switch_to user thread
>> 8. exit kernel
>> 9. more user stuff
>> 
>> What you're really ordering is a, g vs 1, 9 right?
>> 
>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> etc.
>> 
>> Userspace does not care where the barriers are exactly or what kernel
>> memory accesses might be being ordered by them, so long as there is a
>> mb somewhere between a and g, and 1 and 9. Right?
> 
> This is correct.

Actually, sorry, the above is not quite right. It's been a while
since I looked into the details of membarrier.

The smp_mb() at the beginning of membarrier() needs to be paired with a
smp_mb() _after_ rq->curr is switched back to the user thread, so the
memory barrier is between store to rq->curr and following user-space
accesses.

The smp_mb() at the end of membarrier() needs to be paired with the
smp_mb__after_spinlock() at the beginning of schedule, which is
between accesses to userspace memory and switching rq->curr to kthread.

As to *why* this ordering is needed, I'd have to dig through additional
scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thanks,

Mathieu


> Note that the accesses to user-space memory can be
> done either by user-space code or kernel code, it doesn't matter.
> However, in order to be considered as happening before/after
> either membarrier or the matching compiler barrier, kernel code
> needs to have causality relationship with user-space execution,
> e.g. user-space does a system call, or returns from a system call.
> 
> In the case of io_uring, submitting a request or returning from waiting
> on request completion appear to provide this causality relationship.
> 
> Thanks,
> 
> Mathieu
> 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers



- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
> I should be more complete here, especially since I was complaining
> about unclear barrier comment :)
> 
> 
> CPU0 CPU1
> a. user stuff1. user stuff
> b. membarrier()  2. enter kernel
> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
> d. read rq->curr 4. rq->curr switched to kthread
> e. is kthread, skip IPI  5. switch_to kthread
> f. return to user6. rq->curr switched to user thread
> g. user stuff7. switch_to user thread
> 8. exit kernel
> 9. more user stuff
> 
> What you're really ordering is a, g vs 1, 9 right?
> 
> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
> etc.
> 
> Userspace does not care where the barriers are exactly or what kernel
> memory accesses might be being ordered by them, so long as there is a
> mb somewhere between a and g, and 1 and 9. Right?

This is correct. Note that the accesses to user-space memory can be
done either by user-space code or kernel code, it doesn't matter.
However, in order to be considered as happening before/after
either membarrier or the matching compiler barrier, kernel code
needs to have causality relationship with user-space execution,
e.g. user-space does a system call, or returns from a system call.

In the case of io_uring, submitting a request or returning from waiting
on request completion appear to provide this causality relationship.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 7:00 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm:
>> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
>> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
> 
>> >> But I’m wondering if all this deferred sync stuff is wrong. In the
>> >> brave new world of io_uring and such, perhaps kernel access matter
>> >> too.  Heck, even:
>> > 
>> > IIRC the membarrier SYNC_CORE use-case is about user-space
>> > self-modifying code.
>> > 
>> > Userspace re-uses a text address and needs to SYNC_CORE before it can be
>> > sure the old text is forgotten. Nothing the kernel does matters there.
>> > 
>> > I suppose the manpage could be more clear there.
>> 
>> True, but memory ordering of kernel stores from kernel threads for
>> regular mem barrier is the concern here.
>> 
>> Does io_uring update completion queue from kernel thread or interrupt,
>> for example? If it does, then membarrier will not order such stores
>> with user memory accesses.
> 
> So we're talking about regular membarrier() then? Not the SYNC_CORE
> variant per-se.
> 
> Even there, I'll argue we don't care, but perhaps Mathieu has a
> different opinion.

I agree with Peter that we don't care about accesses to user-space
memory performed concurrently with membarrier.

What we'd care about in terms of accesses to user-space memory from the
kernel is something that would be clearly ordered as happening before
or after the membarrier call, for instance a read(2) followed by
membarrier(2) after the read returns, or a read(2) issued after return
from membarrier(2). The other scenario we'd care about is with the compiler
barrier paired with membarrier: e.g. read(2) returns, compiler barrier,
followed by a store. Or load, compiler barrier, followed by write(2).

All those scenarios imply before/after ordering wrt either membarrier or
the compiler barrier. I notice that io_uring has a "completion" queue.
Let's try to come up with realistic usage scenarios.

So the dependency chain would be provided by e.g.:

* Infrequent read / Frequent write, communicating read completion through 
variable X

wait for io_uring read request completion -> membarrier -> store X=1

with matching

load from X (waiting for X==1) -> asm volatile (::: "memory") -> submit 
io_uring write request

or this other scenario:

* Frequent read / Infrequent write, communicating read completion through 
variable X

load from X (waiting for X==1) -> membarrier -> submit io_uring write request

with matching

wait for io_uring read request completion -> asm volatile (::: "memory") -> 
store X=1

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread peterz
On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote:
> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm:
> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:

> >> But I’m wondering if all this deferred sync stuff is wrong. In the
> >> brave new world of io_uring and such, perhaps kernel access matter
> >> too.  Heck, even:
> > 
> > IIRC the membarrier SYNC_CORE use-case is about user-space
> > self-modifying code.
> > 
> > Userspace re-uses a text address and needs to SYNC_CORE before it can be
> > sure the old text is forgotten. Nothing the kernel does matters there.
> > 
> > I suppose the manpage could be more clear there.
> 
> True, but memory ordering of kernel stores from kernel threads for
> regular mem barrier is the concern here.
> 
> Does io_uring update completion queue from kernel thread or interrupt,
> for example? If it does, then membarrier will not order such stores
> with user memory accesses.

So we're talking about regular membarrier() then? Not the SYNC_CORE
variant per-se.

Even there, I'll argue we don't care, but perhaps Mathieu has a
different opinion. All we care about is that all other threads (or CPUs
for GLOBAL) observe an smp_mb() before it returns.

Any serialization against whatever those other threads/CPUs are running
at the instant of the syscall is external to the syscall, we make no
gauarantees about that. That is, we can fundamentally not say what
another CPU is executing concurrently. Nor should we want to.

So if you feel that your membarrier() ought to serialize against remote
execution, you need to arrange a quiecent state on the remote side
yourself.

Now, normally membarrier() is used to implement userspace RCU like
things, and there all that matters is that the remote CPUs observe the
beginngin of the new grace-period, ie counter flip, and we observe their
read-side critical sections, or smething like that, it's been a while
since I looked at all that.

It's always been the case that concurrent syscalls could change user
memory, io_uring doesn't change that, it just makes it even less well
defined when that would happen. If you want to serialize against that,
you need to arrange that externally.


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm:
> On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
>> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
> 
>> > CPU0 CPU1
>> > 1. user stuff
>> > a. membarrier()  2. enter kernel
>> > b. read rq->curr 3. rq->curr switched to kthread
>> > c. is kthread, skip IPI  4. switch_to kthread
>> > d. return to user5. rq->curr switched to user thread
>> > 6. switch_to user thread
>> > 7. exit kernel
>> > 8. more user stuff
> 
>> I find it hard to believe that this is x86 only. Why would thread
>> switch imply core sync on any architecture?  Is x86 unique in having a
>> stupid expensive core sync that is heavier than smp_mb()?
> 
> smp_mb() is nowhere near the most expensive barrier we have in Linux,
> mb() might qualify, since that has some completion requirements since it
> needs to serialize against external actors.
> 
> On x86_64 things are rather murky, we have:
> 
>   LOCK prefix -- which implies smp_mb() before and after RmW
>   LFENCE -- which used to be rmb like, until Spectre, and now it
> is ISYNC like. Since ISYNC ensures an empty pipeline,
> it also implies all loads are retired (and therefore
> complete) it implies rmb.
>   MFENCE -- which is a memop completion barrier like, it makes
> sure all previously issued memops are complete.
> 
> if you read that carefully, you'll note you'll have to use LFENCE +
> MFENCE to order against non-memops instructions.
> 
> But none of them imply dumping the instruction decoder caches, that only
> happens on core serializing instructions like CR3 writes, IRET, CPUID
> and a few others, I think we recently got a SERIALIZE instruction to add
> to this list.
> 
> 
> On ARM64 there's something a whole different set of barriers, and again
> smp_mb() isn't nowhere near the top of the list. They have roughly 3
> classes:
> 
>   ISB -- instruction sync barrier
>   DMB(x) -- memory ordering in domain x
>   DSB(x) -- memory completion in domain x
> 
> And they have at least 3 domains (IIRC), system, outer, inner.
> 
> The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to
> have a SYNC, but since PowerPC is rare for only having one rediculously
> heavy serializing instruction, we got to re-use the smp_mb() early in
> __schedule() instead, but ARM64 can't do that.
> 
> 
> So rather than say that x86 is special here, I'd say that PowerPC is
> special here.

PowerPC is "special", I'll agree with you there :)

It does have a SYNC (HWSYNC) instruction that is mb(). It does not
serialize the core.

ISYNC is a nop. ICBI ; ISYNC does serialize the core.

Difference between them is probably much the same as difference between
MFENCE and CPUID on x86 CPUs. Serializing the core is almost always 
pretty expensive. HWSYNC/MFENCE can be expensive if you have a lot of
or difficult (not exclusive in cache) outstanding with critical reads
after the barrier, but it can also be somewhat cheap if there are few
writes, and executed past, it only needs to hold up subsequent reads.

That said... implementation details. powerpc CPUs have traditionally
had fairly costly HWSYNC.


>> But I’m wondering if all this deferred sync stuff is wrong. In the
>> brave new world of io_uring and such, perhaps kernel access matter
>> too.  Heck, even:
> 
> IIRC the membarrier SYNC_CORE use-case is about user-space
> self-modifying code.
> 
> Userspace re-uses a text address and needs to SYNC_CORE before it can be
> sure the old text is forgotten. Nothing the kernel does matters there.
> 
> I suppose the manpage could be more clear there.

True, but memory ordering of kernel stores from kernel threads for
regular mem barrier is the concern here.

Does io_uring update completion queue from kernel thread or interrupt,
for example? If it does, then membarrier will not order such stores
with user memory accesses.

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Peter Zijlstra
On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:

> > CPU0 CPU1
> > 1. user stuff
> > a. membarrier()  2. enter kernel
> > b. read rq->curr 3. rq->curr switched to kthread
> > c. is kthread, skip IPI  4. switch_to kthread
> > d. return to user5. rq->curr switched to user thread
> > 6. switch_to user thread
> > 7. exit kernel
> > 8. more user stuff

> I find it hard to believe that this is x86 only. Why would thread
> switch imply core sync on any architecture?  Is x86 unique in having a
> stupid expensive core sync that is heavier than smp_mb()?

smp_mb() is nowhere near the most expensive barrier we have in Linux,
mb() might qualify, since that has some completion requirements since it
needs to serialize against external actors.

On x86_64 things are rather murky, we have:

LOCK prefix -- which implies smp_mb() before and after RmW
LFENCE -- which used to be rmb like, until Spectre, and now it
  is ISYNC like. Since ISYNC ensures an empty pipeline,
  it also implies all loads are retired (and therefore
  complete) it implies rmb.
MFENCE -- which is a memop completion barrier like, it makes
  sure all previously issued memops are complete.

if you read that carefully, you'll note you'll have to use LFENCE +
MFENCE to order against non-memops instructions.

But none of them imply dumping the instruction decoder caches, that only
happens on core serializing instructions like CR3 writes, IRET, CPUID
and a few others, I think we recently got a SERIALIZE instruction to add
to this list.


On ARM64 there's something a whole different set of barriers, and again
smp_mb() isn't nowhere near the top of the list. They have roughly 3
classes:

ISB -- instruction sync barrier
DMB(x) -- memory ordering in domain x
DSB(x) -- memory completion in domain x

And they have at least 3 domains (IIRC), system, outer, inner.

The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to
have a SYNC, but since PowerPC is rare for only having one rediculously
heavy serializing instruction, we got to re-use the smp_mb() early in
__schedule() instead, but ARM64 can't do that.


So rather than say that x86 is special here, I'd say that PowerPC is
special here.

> But I’m wondering if all this deferred sync stuff is wrong. In the
> brave new world of io_uring and such, perhaps kernel access matter
> too.  Heck, even:

IIRC the membarrier SYNC_CORE use-case is about user-space
self-modifying code.

Userspace re-uses a text address and needs to SYNC_CORE before it can be
sure the old text is forgotten. Nothing the kernel does matters there.

I suppose the manpage could be more clear there.



Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of July 16, 2020 3:18 pm:
> 
> 
>> On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
>> 
>> Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am:
>>> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
>>> 
 Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
>> Also, as it stands, I can easily see in_irq() ceasing to promise to
>> serialize.  There are older kernels for which it does not promise to
>> serialize.  And I have plans to make it stop serializing in the
>> nearish future.
> 
> You mean x86's return from interrupt? Sounds fun... you'll konw where to
> update the membarrier sync code, at least :)
 
 Oh, I should actually say Mathieu recently clarified a return from
 interrupt doesn't fundamentally need to serialize in order to support
 membarrier sync core.
>>> 
>>> Clarification to your statement:
>>> 
>>> Return from interrupt to kernel code does not need to be context serializing
>>> as long as kernel serializes before returning to user-space.
>>> 
>>> However, return from interrupt to user-space needs to be context 
>>> serializing.
>> 
>> Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb
>> in the right places.
>> 
>> A kernel thread does a use_mm, then it blocks and the user process with
>> the same mm runs on that CPU, and then it calls into the kernel, blocks,
>> the kernel thread runs again, another CPU issues a membarrier which does
>> not IPI this one because it's running a kthread, and then the kthread
>> switches back to the user process (still without having unused the mm),
>> and then the user process returns from syscall without having done a 
>> core synchronising instruction.
>> 
>> The cause of the problem is you want to avoid IPI'ing kthreads. Why?
>> I'm guessing it really only matters as an optimisation in case of idle
>> threads. Idle thread is easy (well, easier) because it won't use_mm, so 
>> you could check for rq->curr == rq->idle in your loop (in a suitable 
>> sched accessor function).
>> 
>> But... I'm not really liking this subtlety in the scheduler for all this 
>> (the scheduler still needs the barriers when switching out of idle).
>> 
>> Can it be improved somehow? Let me forget x86 core sync problem for now
>> (that _may_ be a bit harder), and step back and look at what we're doing.
>> The memory barrier case would actually suffer from the same problem as
>> core sync, because in the same situation it has no implicit mmdrop in
>> the scheduler switch code either.
>> 
>> So what are we doing with membarrier? We want any activity caused by the 
>> set of CPUs/threads specified that can be observed by this thread before 
>> calling membarrier is appropriately fenced from activity that can be 
>> observed to happen after the call returns.
>> 
>> CPU0 CPU1
>> 1. user stuff
>> a. membarrier()  2. enter kernel
>> b. read rq->curr 3. rq->curr switched to kthread
>> c. is kthread, skip IPI  4. switch_to kthread
>> d. return to user5. rq->curr switched to user thread
>> 6. switch_to user thread
>> 7. exit kernel
>> 8. more user stuff
>> 
>> As far as I can see, the problem is CPU1 might reorder step 5 and step
>> 8, so you have mmdrop of lazy mm be a mb after step 6.
>> 
>> But why? The membarrier call only cares that there is a full barrier
>> between 1 and 8, right? Which it will get from the previous context
>> switch to the kthread.
>> 
>> I must say the memory barrier comments in membarrier could be improved
>> a bit (unless I'm missing where the main comment is). It's fine to know
>> what barriers pair with one another, but we need to know which exact
>> memory accesses it is ordering
>> 
>>   /*
>> * Matches memory barriers around rq->curr modification in
>> * scheduler.
>> */
>> 
>> Sure, but it doesn't say what else is being ordered. I think it's just
>> the user memory accesses, but would be nice to make that a bit more
>> explicit. If we had such comments then we might know this case is safe.
>> 
>> I think the funny powerpc barrier is a similar case of this. If we
>> ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that
>> CPU has or will have issued a memory barrier between running user
>> code.
>> 
>> So AFAIKS all this membarrier stuff in kernel/sched/core.c could
>> just go away. Except x86 because thread switch doesn't imply core
>> sync, so CPU1 between 1 and 8 may never issue a core sync instruction
>> the same way a context switch must be a full mb.
>> 
>> Before getting to x86 -- Am I right, or way off track here?
> 
> I find it hard to believe that this is x86 only. Why would thread switch 
> imply core sync on any architecture?  Is x86 unique in having a 

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-15 Thread Andy Lutomirski



> On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am:
>> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
>> 
>>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
 Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
> Also, as it stands, I can easily see in_irq() ceasing to promise to
> serialize.  There are older kernels for which it does not promise to
> serialize.  And I have plans to make it stop serializing in the
> nearish future.
 
 You mean x86's return from interrupt? Sounds fun... you'll konw where to
 update the membarrier sync code, at least :)
>>> 
>>> Oh, I should actually say Mathieu recently clarified a return from
>>> interrupt doesn't fundamentally need to serialize in order to support
>>> membarrier sync core.
>> 
>> Clarification to your statement:
>> 
>> Return from interrupt to kernel code does not need to be context serializing
>> as long as kernel serializes before returning to user-space.
>> 
>> However, return from interrupt to user-space needs to be context serializing.
> 
> Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb
> in the right places.
> 
> A kernel thread does a use_mm, then it blocks and the user process with
> the same mm runs on that CPU, and then it calls into the kernel, blocks,
> the kernel thread runs again, another CPU issues a membarrier which does
> not IPI this one because it's running a kthread, and then the kthread
> switches back to the user process (still without having unused the mm),
> and then the user process returns from syscall without having done a 
> core synchronising instruction.
> 
> The cause of the problem is you want to avoid IPI'ing kthreads. Why?
> I'm guessing it really only matters as an optimisation in case of idle
> threads. Idle thread is easy (well, easier) because it won't use_mm, so 
> you could check for rq->curr == rq->idle in your loop (in a suitable 
> sched accessor function).
> 
> But... I'm not really liking this subtlety in the scheduler for all this 
> (the scheduler still needs the barriers when switching out of idle).
> 
> Can it be improved somehow? Let me forget x86 core sync problem for now
> (that _may_ be a bit harder), and step back and look at what we're doing.
> The memory barrier case would actually suffer from the same problem as
> core sync, because in the same situation it has no implicit mmdrop in
> the scheduler switch code either.
> 
> So what are we doing with membarrier? We want any activity caused by the 
> set of CPUs/threads specified that can be observed by this thread before 
> calling membarrier is appropriately fenced from activity that can be 
> observed to happen after the call returns.
> 
> CPU0 CPU1
> 1. user stuff
> a. membarrier()  2. enter kernel
> b. read rq->curr 3. rq->curr switched to kthread
> c. is kthread, skip IPI  4. switch_to kthread
> d. return to user5. rq->curr switched to user thread
> 6. switch_to user thread
> 7. exit kernel
> 8. more user stuff
> 
> As far as I can see, the problem is CPU1 might reorder step 5 and step
> 8, so you have mmdrop of lazy mm be a mb after step 6.
> 
> But why? The membarrier call only cares that there is a full barrier
> between 1 and 8, right? Which it will get from the previous context
> switch to the kthread.
> 
> I must say the memory barrier comments in membarrier could be improved
> a bit (unless I'm missing where the main comment is). It's fine to know
> what barriers pair with one another, but we need to know which exact
> memory accesses it is ordering
> 
>   /*
> * Matches memory barriers around rq->curr modification in
> * scheduler.
> */
> 
> Sure, but it doesn't say what else is being ordered. I think it's just
> the user memory accesses, but would be nice to make that a bit more
> explicit. If we had such comments then we might know this case is safe.
> 
> I think the funny powerpc barrier is a similar case of this. If we
> ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that
> CPU has or will have issued a memory barrier between running user
> code.
> 
> So AFAIKS all this membarrier stuff in kernel/sched/core.c could
> just go away. Except x86 because thread switch doesn't imply core
> sync, so CPU1 between 1 and 8 may never issue a core sync instruction
> the same way a context switch must be a full mb.
> 
> Before getting to x86 -- Am I right, or way off track here?

I find it hard to believe that this is x86 only. Why would thread switch imply 
core sync on any architecture?  Is x86 unique in having a stupid expensive core 
sync that is heavier than smp_mb()?

But I’m wondering if all this deferred sync stuff is wrong. In the brave new 
world of io_uring and such, perhaps 

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-15 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of July 16, 2020 2:15 pm:
> Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am:
>> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
>> 
>>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
 Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
> Also, as it stands, I can easily see in_irq() ceasing to promise to
> serialize.  There are older kernels for which it does not promise to
> serialize.  And I have plans to make it stop serializing in the
> nearish future.
 
 You mean x86's return from interrupt? Sounds fun... you'll konw where to
 update the membarrier sync code, at least :)
>>> 
>>> Oh, I should actually say Mathieu recently clarified a return from
>>> interrupt doesn't fundamentally need to serialize in order to support
>>> membarrier sync core.
>> 
>> Clarification to your statement:
>> 
>> Return from interrupt to kernel code does not need to be context serializing
>> as long as kernel serializes before returning to user-space.
>> 
>> However, return from interrupt to user-space needs to be context serializing.
> 
> Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb
> in the right places.
> 
> A kernel thread does a use_mm, then it blocks and the user process with
> the same mm runs on that CPU, and then it calls into the kernel, blocks,
> the kernel thread runs again, another CPU issues a membarrier which does
> not IPI this one because it's running a kthread, and then the kthread
> switches back to the user process (still without having unused the mm),
> and then the user process returns from syscall without having done a 
> core synchronising instruction.
> 
> The cause of the problem is you want to avoid IPI'ing kthreads. Why?
> I'm guessing it really only matters as an optimisation in case of idle
> threads. Idle thread is easy (well, easier) because it won't use_mm, so 
> you could check for rq->curr == rq->idle in your loop (in a suitable 
> sched accessor function).
> 
> But... I'm not really liking this subtlety in the scheduler for all this 
> (the scheduler still needs the barriers when switching out of idle).
> 
> Can it be improved somehow? Let me forget x86 core sync problem for now
> (that _may_ be a bit harder), and step back and look at what we're doing.
> The memory barrier case would actually suffer from the same problem as
> core sync, because in the same situation it has no implicit mmdrop in
> the scheduler switch code either.
> 
> So what are we doing with membarrier? We want any activity caused by the 
> set of CPUs/threads specified that can be observed by this thread before 
> calling membarrier is appropriately fenced from activity that can be 
> observed to happen after the call returns.
> 
> CPU0 CPU1
>  1. user stuff
> a. membarrier()  2. enter kernel
> b. read rq->curr 3. rq->curr switched to kthread
> c. is kthread, skip IPI  4. switch_to kthread
> d. return to user5. rq->curr switched to user thread
>6. switch_to user thread
>7. exit kernel
>  8. more user stuff
> 
> As far as I can see, the problem is CPU1 might reorder step 5 and step
> 8, so you have mmdrop of lazy mm be a mb after step 6.
> 
> But why? The membarrier call only cares that there is a full barrier
> between 1 and 8, right? Which it will get from the previous context
> switch to the kthread.

I should be more complete here, especially since I was complaining
about unclear barrier comment :)


CPU0 CPU1
a. user stuff1. user stuff
b. membarrier()  2. enter kernel
c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
d. read rq->curr 4. rq->curr switched to kthread
e. is kthread, skip IPI  5. switch_to kthread
f. return to user6. rq->curr switched to user thread
g. user stuff7. switch_to user thread
 8. exit kernel
 9. more user stuff

What you're really ordering is a, g vs 1, 9 right?

In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
etc.

Userspace does not care where the barriers are exactly or what kernel 
memory accesses might be being ordered by them, so long as there is a
mb somewhere between a and g, and 1 and 9. Right?


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-15 Thread Nicholas Piggin
Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am:
> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
> 
>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
 Also, as it stands, I can easily see in_irq() ceasing to promise to
 serialize.  There are older kernels for which it does not promise to
 serialize.  And I have plans to make it stop serializing in the
 nearish future.
>>> 
>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to
>>> update the membarrier sync code, at least :)
>> 
>> Oh, I should actually say Mathieu recently clarified a return from
>> interrupt doesn't fundamentally need to serialize in order to support
>> membarrier sync core.
> 
> Clarification to your statement:
> 
> Return from interrupt to kernel code does not need to be context serializing
> as long as kernel serializes before returning to user-space.
> 
> However, return from interrupt to user-space needs to be context serializing.

Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb
in the right places.

A kernel thread does a use_mm, then it blocks and the user process with
the same mm runs on that CPU, and then it calls into the kernel, blocks,
the kernel thread runs again, another CPU issues a membarrier which does
not IPI this one because it's running a kthread, and then the kthread
switches back to the user process (still without having unused the mm),
and then the user process returns from syscall without having done a 
core synchronising instruction.

The cause of the problem is you want to avoid IPI'ing kthreads. Why?
I'm guessing it really only matters as an optimisation in case of idle
threads. Idle thread is easy (well, easier) because it won't use_mm, so 
you could check for rq->curr == rq->idle in your loop (in a suitable 
sched accessor function).

But... I'm not really liking this subtlety in the scheduler for all this 
(the scheduler still needs the barriers when switching out of idle).

Can it be improved somehow? Let me forget x86 core sync problem for now
(that _may_ be a bit harder), and step back and look at what we're doing.
The memory barrier case would actually suffer from the same problem as
core sync, because in the same situation it has no implicit mmdrop in
the scheduler switch code either.

So what are we doing with membarrier? We want any activity caused by the 
set of CPUs/threads specified that can be observed by this thread before 
calling membarrier is appropriately fenced from activity that can be 
observed to happen after the call returns.

CPU0 CPU1
 1. user stuff
a. membarrier()  2. enter kernel
b. read rq->curr 3. rq->curr switched to kthread
c. is kthread, skip IPI  4. switch_to kthread
d. return to user5. rq->curr switched to user thread
 6. switch_to user thread
 7. exit kernel
 8. more user stuff

As far as I can see, the problem is CPU1 might reorder step 5 and step
8, so you have mmdrop of lazy mm be a mb after step 6.

But why? The membarrier call only cares that there is a full barrier
between 1 and 8, right? Which it will get from the previous context
switch to the kthread.

I must say the memory barrier comments in membarrier could be improved
a bit (unless I'm missing where the main comment is). It's fine to know
what barriers pair with one another, but we need to know which exact
memory accesses it is ordering

   /*
 * Matches memory barriers around rq->curr modification in
 * scheduler.
 */

Sure, but it doesn't say what else is being ordered. I think it's just
the user memory accesses, but would be nice to make that a bit more
explicit. If we had such comments then we might know this case is safe.

I think the funny powerpc barrier is a similar case of this. If we
ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that
CPU has or will have issued a memory barrier between running user
code.

So AFAIKS all this membarrier stuff in kernel/sched/core.c could
just go away. Except x86 because thread switch doesn't imply core
sync, so CPU1 between 1 and 8 may never issue a core sync instruction
the same way a context switch must be a full mb.

Before getting to x86 -- Am I right, or way off track here? 

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-13 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of July 14, 2020 1:48 am:
> On Mon, Jul 13, 2020 at 7:13 AM Mathieu Desnoyers
>  wrote:
>>
>> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
>>
>> > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
>> >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
>> >>> Also, as it stands, I can easily see in_irq() ceasing to promise to
>> >>> serialize.  There are older kernels for which it does not promise to
>> >>> serialize.  And I have plans to make it stop serializing in the
>> >>> nearish future.
>> >>
>> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to
>> >> update the membarrier sync code, at least :)
>> >
>> > Oh, I should actually say Mathieu recently clarified a return from
>> > interrupt doesn't fundamentally need to serialize in order to support
>> > membarrier sync core.
>>
>> Clarification to your statement:
>>
>> Return from interrupt to kernel code does not need to be context serializing
>> as long as kernel serializes before returning to user-space.
>>
>> However, return from interrupt to user-space needs to be context serializing.
>>
> 
> Indeed, and I figured this out on the first read through because I'm
> quite familiar with the x86 entry code.  But Nick somehow missed this,
> and Nick is the one who wrote the patch.
> 
> Nick, I think this helps prove my point.  The code you're submitting
> may well be correct, but it's unmaintainable.

It's not. The patch I wrote for x86 is a no-op, it just moves existing
x86 hook and code that's already there to a different name.

Actually it's not quite a no-op, it't changes it to use hooks that are
actually called in the right places. Because previously it was
unmaintainable from point of view of generic mm -- it was not clear at
all that the old one should have been called in other places where the
mm goes non-lazy. Now with the exit_lazy_tlb hook, it can quite easily
be spotted where it is missing.

And x86 keeps their membarrier code in x86, and uses nice well defined
lazy tlb mm hooks.

> At the very least, this
> needs a comment explaining, from the perspective of x86, *exactly*
> what exit_lazy_tlb() is promising, why it's promising it, how it
> achieves that promise, and what code cares about it.  Or we could do
> something with TIF flags and make this all less magical, although that
> will probably end up very slightly slower.

It's all documented there in existing comments plus the asm-generic
exit_lazy_tlb specification added AFAIKS.

Is the membarrier comment in finish_task_switch plus these ones not
enough?

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-13 Thread Andy Lutomirski
On Mon, Jul 13, 2020 at 7:13 AM Mathieu Desnoyers
 wrote:
>
> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
>
> > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
> >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
> >>> Also, as it stands, I can easily see in_irq() ceasing to promise to
> >>> serialize.  There are older kernels for which it does not promise to
> >>> serialize.  And I have plans to make it stop serializing in the
> >>> nearish future.
> >>
> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to
> >> update the membarrier sync code, at least :)
> >
> > Oh, I should actually say Mathieu recently clarified a return from
> > interrupt doesn't fundamentally need to serialize in order to support
> > membarrier sync core.
>
> Clarification to your statement:
>
> Return from interrupt to kernel code does not need to be context serializing
> as long as kernel serializes before returning to user-space.
>
> However, return from interrupt to user-space needs to be context serializing.
>

Indeed, and I figured this out on the first read through because I'm
quite familiar with the x86 entry code.  But Nick somehow missed this,
and Nick is the one who wrote the patch.

Nick, I think this helps prove my point.  The code you're submitting
may well be correct, but it's unmaintainable.  At the very least, this
needs a comment explaining, from the perspective of x86, *exactly*
what exit_lazy_tlb() is promising, why it's promising it, how it
achieves that promise, and what code cares about it.  Or we could do
something with TIF flags and make this all less magical, although that
will probably end up very slightly slower.

--Andy


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-13 Thread Mathieu Desnoyers
- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
>>> Also, as it stands, I can easily see in_irq() ceasing to promise to
>>> serialize.  There are older kernels for which it does not promise to
>>> serialize.  And I have plans to make it stop serializing in the
>>> nearish future.
>> 
>> You mean x86's return from interrupt? Sounds fun... you'll konw where to
>> update the membarrier sync code, at least :)
> 
> Oh, I should actually say Mathieu recently clarified a return from
> interrupt doesn't fundamentally need to serialize in order to support
> membarrier sync core.

Clarification to your statement:

Return from interrupt to kernel code does not need to be context serializing
as long as kernel serializes before returning to user-space.

However, return from interrupt to user-space needs to be context serializing.

Thanks,

Mathieu

> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html
> 
> So you may not need to do anything more if you relaxed it.
> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-13 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
>> Also, as it stands, I can easily see in_irq() ceasing to promise to
>> serialize.  There are older kernels for which it does not promise to
>> serialize.  And I have plans to make it stop serializing in the
>> nearish future.
> 
> You mean x86's return from interrupt? Sounds fun... you'll konw where to 
> update the membarrier sync code, at least :)

Oh, I should actually say Mathieu recently clarified a return from
interrupt doesn't fundamentally need to serialize in order to support
membarrier sync core.

https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html

So you may not need to do anything more if you relaxed it.

Thanks,
Nick


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-12 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>>
>> And get rid of the generic sync_core_before_usermode facility.
>>
>> This helper is the wrong way around I think. The idea that membarrier
>> state requires a core sync before returning to user is the easy one
>> that does not need hiding behind membarrier calls. The gap in core
>> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
>> tricky detail that is better put in x86 lazy tlb code.
>>
>> Consider if an arch did not synchronize core in switch_mm either, then
>> membarrier_mm_sync_core_before_usermode would be in the wrong place
>> but arch specific mmu context functions would still be the right place.
>> There is also a exit_lazy_tlb case that is not covered by this call, which
>> could be a bugs (kthread use mm the membarrier process's mm then context
>> switch back to the process without switching mm or lazy mm switch).
>>
>> This makes lazy tlb code a bit more modular.
> 
> The mm-switching and TLB-management has often had the regrettable
> property that it gets wired up in a way that seems to work at the time
> but doesn't have clear semantics, and I'm a bit concerned that this
> patch is in that category.

It's much more explicit in the core code about where hooks are called
after this patch. And then the x86 membarrier implementation details
are contained to the x86 code where they belong, and we don't have the
previous hook with unclear semantics missing from core code.

> If I'm understanding right, you're trying
> to enforce the property that exiting lazy TLB mode will promise to
> sync the core eventually.  But this has all kinds of odd properties:
> 
>  - Why is exit_lazy_tlb() getting called at all in the relevant cases?

It's a property of how MEMBARRIER_SYNC_CORE is implemented by arch/x86,
see the membarrier comment in finish_task_switch (for analogous reason).

>  When is it permissible to call it?

Comment for the asm-generic code says it's to be called when the lazy
active mm becomes non-lazy.

> I look at your new code and see:
> 
>> +/*
>> + * Ensure that a core serializing instruction is issued before returning
>> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
>> + * user-space through sysexit, sysrel, and sysretq, which are not core
>> + * serializing.
>> + *
>> + * See the membarrier comment in finish_task_switch as to why this is done
>> + * in exit_lazy_tlb.
>> + */
>> +#define exit_lazy_tlb exit_lazy_tlb
>> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct 
>> *tsk)
>> +{
>> +   /* Switching mm is serializing with write_cr3 */
>> +if (tsk->mm != mm)
>> +return;
> 
> And my brain says WTF?  Surely you meant something like if
> (WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened?  how do
> we try to recover well enough to get a crashed logged at least? */ }

No, the active mm can be unlazied by switching to a different mm.

> So this needs actual documentation, preferably in comments near the
> function, of what the preconditions are and what this mm parameter is.
> Once that's done, then we could consider whether it's appropriate to
> have this function promise to sync the core under some conditions.

It's documented in generic code. I prefer not to duplicate comments
too much but I can add a "refer to asm-generic version for usage" or
something if you'd like?

>  - This whole structure seems to rely on the idea that switching mm
> syncs something.

Which whole structure? The x86 implementation of sync core explicitly
does rely on that, yes. But I've pulled that out of core code with
this patch.

> I periodically ask chip vendor for non-serializing
> mm switches.  Specifically, in my dream world, we have totally
> separate user and kernel page tables.  Changing out the user tables
> doesn't serialize or even create a fence.  Instead it creates the
> minimum required pipeline hazard such that user memory access and
> switches to usermode will make sure they access memory through the
> correct page tables.  I haven't convinced a chip vendor yet, but there
> are quite a few hundreds of cycles to be saved here.

The fundmaental difficulty is that the kernel can still access user
mappings any time after the switch. We can probably handwave ways
around it by serializing lazily when encountering the next user
access and hoping that most of your mm switches result in a kernel
exit that serializes or some other unavoidable serialize so you can
avoid the mm switch one. In practice it sounds like a lot of trouble.
But anyway the sync core could presumably be adjusted or reasoned to
still be correct, depending on how it works.

> With this in
> mind, I see the fencing aspects of the TLB handling code as somewhat
> of an accident.  I'm fine with documenting them and using them to
> optimize other paths, but I think it should be explicit.  For example:
> 
> 

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-10 Thread Andy Lutomirski
On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>
> And get rid of the generic sync_core_before_usermode facility.
>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
>
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
>
> This makes lazy tlb code a bit more modular.

The mm-switching and TLB-management has often had the regrettable
property that it gets wired up in a way that seems to work at the time
but doesn't have clear semantics, and I'm a bit concerned that this
patch is in that category.  If I'm understanding right, you're trying
to enforce the property that exiting lazy TLB mode will promise to
sync the core eventually.  But this has all kinds of odd properties:

 - Why is exit_lazy_tlb() getting called at all in the relevant cases?
 When is it permissible to call it?  I look at your new code and see:

> +/*
> + * Ensure that a core serializing instruction is issued before returning
> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
> + * user-space through sysexit, sysrel, and sysretq, which are not core
> + * serializing.
> + *
> + * See the membarrier comment in finish_task_switch as to why this is done
> + * in exit_lazy_tlb.
> + */
> +#define exit_lazy_tlb exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct 
> *tsk)
> +{
> +   /* Switching mm is serializing with write_cr3 */
> +if (tsk->mm != mm)
> +return;

And my brain says WTF?  Surely you meant something like if
(WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened?  how do
we try to recover well enough to get a crashed logged at least? */ }

So this needs actual documentation, preferably in comments near the
function, of what the preconditions are and what this mm parameter is.
Once that's done, then we could consider whether it's appropriate to
have this function promise to sync the core under some conditions.

 - This whole structure seems to rely on the idea that switching mm
syncs something.  I periodically ask chip vendor for non-serializing
mm switches.  Specifically, in my dream world, we have totally
separate user and kernel page tables.  Changing out the user tables
doesn't serialize or even create a fence.  Instead it creates the
minimum required pipeline hazard such that user memory access and
switches to usermode will make sure they access memory through the
correct page tables.  I haven't convinced a chip vendor yet, but there
are quite a few hundreds of cycles to be saved here.  With this in
mind, I see the fencing aspects of the TLB handling code as somewhat
of an accident.  I'm fine with documenting them and using them to
optimize other paths, but I think it should be explicit.  For example:

/* Also does a full barrier?  (Or a sync_core()-style barrier.)
However, if you rely on this, you must document it in a comment where
you call this function. *?
void switch_mm_irqs_off()
{
}

This is kind of like how we strongly encourage anyone using smp_?mb()
to document what they are fencing against.

Also, as it stands, I can easily see in_irq() ceasing to promise to
serialize.  There are older kernels for which it does not promise to
serialize.  And I have plans to make it stop serializing in the
nearish future.

--Andy


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-10 Thread Mathieu Desnoyers
- On Jul 9, 2020, at 9:56 PM, Nicholas Piggin npig...@gmail.com wrote:

> And get rid of the generic sync_core_before_usermode facility.
> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
> 
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
> 
> This makes lazy tlb code a bit more modular.

I agree that moving this logic to exit_lazy_tlb is much more modular and
cleaner.

Thanks,

Mathieu


> 
> Signed-off-by: Nicholas Piggin 
> ---
> .../membarrier-sync-core/arch-support.txt |  6 +++-
> arch/x86/include/asm/mmu_context.h| 35 +++
> arch/x86/include/asm/sync_core.h  | 28 ---
> include/linux/sched/mm.h  | 14 
> include/linux/sync_core.h | 21 ---
> kernel/cpu.c  |  4 ++-
> kernel/kthread.c  |  2 +-
> kernel/sched/core.c   | 16 -
> 8 files changed, 51 insertions(+), 75 deletions(-)
> delete mode 100644 arch/x86/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
> 
> diff --git 
> a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> index 52ad74a25f54..bd43fb1f5986 100644
> --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
> @@ -5,6 +5,10 @@
> #
> # Architecture requirements
> #
> +# If your architecture returns to user-space through non-core-serializing
> +# instructions, you need to ensure these are done in switch_mm and
> exit_lazy_tlb
> +# (if lazy tlb switching is implemented).
> +#
> # * arm/arm64/powerpc
> #
> # Rely on implicit context synchronization as a result of exception return
> @@ -24,7 +28,7 @@
> # instead on write_cr3() performed by switch_mm() to provide core 
> serialization
> # after changing the current mm, and deal with the special case of kthread ->
> # uthread (temporarily keeping current mm into active_mm) by issuing a
> -# sync_core_before_usermode() in that specific case.
> +# serializing instruction in exit_lazy_mm() in that specific case.
> #
> ---
> | arch |status|
> diff --git a/arch/x86/include/asm/mmu_context.h
> b/arch/x86/include/asm/mmu_context.h
> index 255750548433..5263863a9be8 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -6,6 +6,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #include 
> 
> @@ -95,6 +96,40 @@ static inline void switch_ldt(struct mm_struct *prev, 
> struct
> mm_struct *next)
> #define enter_lazy_tlb enter_lazy_tlb
> extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
> 
> +#ifdef CONFIG_MEMBARRIER
> +/*
> + * Ensure that a core serializing instruction is issued before returning
> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
> + * user-space through sysexit, sysrel, and sysretq, which are not core
> + * serializing.
> + *
> + * See the membarrier comment in finish_task_switch as to why this is done
> + * in exit_lazy_tlb.
> + */
> +#define exit_lazy_tlb exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct 
> *tsk)
> +{
> + /* Switching mm is serializing with write_cr3 */
> +if (tsk->mm != mm)
> +return;
> +
> +if (likely(!(atomic_read(>membarrier_state) &
> + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
> +return;
> +
> + /* With PTI, we unconditionally serialize before running user code. */
> + if (static_cpu_has(X86_FEATURE_PTI))
> + return;
> + /*
> +  * Return from interrupt and NMI is done through iret, which is core
> +  * serializing.
> +  */
> + if (in_irq() || in_nmi())
> + return;
> + sync_core();
> +}
> +#endif
> +
> /*
>  * Init a new mm.  Used on mm copies, like at fork()
>  * and on mm's that are brand-new, like at execve().
> diff --git a/arch/x86/include/asm/sync_core.h 
> b/arch/x86/include/asm/sync_core.h
> deleted file mode 100644
> index c67caafd3381..
> --- a/arch/x86/include/asm/sync_core.h
> +++ /dev/null
> @@ -1,28 +0,0 @@
> -/* 

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-10 Thread Peter Zijlstra
On Fri, Jul 10, 2020 at 11:56:43AM +1000, Nicholas Piggin wrote:
> And get rid of the generic sync_core_before_usermode facility.
> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
> 
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
> 
> This makes lazy tlb code a bit more modular.

Hurmph, I know I've been staring at this at some point. I think I meant
to have a TIF to force the IRET path in the case of MEMBAR_SYNC_CORE.
But I was discouraged by amluto.