Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-15 Thread Christopher Lameter
On Fri, 15 Dec 2017, Mathieu Desnoyers wrote:

> Another aspect that worries me is applications using the gs segment selector
> for other purposes. Suddenly reserving the gs segment selector for use by a
> library like glibc may lead to incompatibilities with applications already
> using it.

fs/gs seems to be reserved for thread local storage. So it would
be shared in user space like the corresponding cpu segment register in
kernel space where multiple subsystems share %gs.

The same can be done in user space. Ulrich Drepper has a writeup on this

https://www.akkadia.org/drepper/tls.pdf

Savings in execution time could come about because there would not be the
need to determine the address of the processor specific memory area in
each restartable sequence and there would be memory free of contention for
such a sequence in order f.e. to realize fast counters.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-15 Thread Christopher Lameter
On Fri, 15 Dec 2017, Mathieu Desnoyers wrote:

> Another aspect that worries me is applications using the gs segment selector
> for other purposes. Suddenly reserving the gs segment selector for use by a
> library like glibc may lead to incompatibilities with applications already
> using it.

fs/gs seems to be reserved for thread local storage. So it would
be shared in user space like the corresponding cpu segment register in
kernel space where multiple subsystems share %gs.

The same can be done in user space. Ulrich Drepper has a writeup on this

https://www.akkadia.org/drepper/tls.pdf

Savings in execution time could come about because there would not be the
need to determine the address of the processor specific memory area in
each restartable sequence and there would be memory free of contention for
such a sequence in order f.e. to realize fast counters.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-15 Thread Mathieu Desnoyers
- On Dec 15, 2017, at 10:05 AM, Chris Lameter c...@linux.com wrote:

> On Thu, 14 Dec 2017, Peter Zijlstra wrote:
> 
>> > But my company has extensive user space code that maintains a lot of
>> > counters and does other tricks to get full performance out of the
>> > hardware. Such a mechanism would also be good from user space. Why keep
>> > the good stuff only inside the kernel?
>>
>> Mathieu's proposal is for userspace, _only_ userspace.
> 
> But what we were talking about are instructions that work effectively in
> kernel space whose efficiency restartable sequences could bring to user
> space.

It can be worthwhile to recap my understanding of this thread so far:

AFAIU, Chris' proposal is to use the "gs" segment selector as instruction
prefix on x86 rather than explicitly loading CPU number and calculating
offsets.

This can turn sequences of rseq operations like this cmpxchg:

Registers:

  R1: return value
  R2: expected value
  R3: new value
  R4: cpu_id

rseq cmpxchg:

  load TLS::cpu_id_start into R4
  calculate offset of v
  fs:mov (store rseq descriptor address into TLS::rseq_cs)
  compare R4 against TLS::cpu_id
  jne abort
  mov (load v into R1)
  compare R1 against R2
  jne cmpfail
  mov (store R3 into *v)

into:

  fs:mov (store rseq descriptor address into TLS::rseq_cs)
  gs:mov (load *v+off into R1)
  compare R1 against R2
  jne cmpfail
  gs:mov (store R3 into *v+off)

My first concern with this approach is the lack of flexibility of the segment
selector method wrt variety of schemes user-space has to deal with for memory
allocation. In the kernel, this is achieved by ensuring that all per-cpu data
layout is segment-selector-prefix friendly.

Another aspect that worries me is applications using the gs segment selector
for other purposes. Suddenly reserving the gs segment selector for use by a
library like glibc may lead to incompatibilities with applications already
using it.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-15 Thread Mathieu Desnoyers
- On Dec 15, 2017, at 10:05 AM, Chris Lameter c...@linux.com wrote:

> On Thu, 14 Dec 2017, Peter Zijlstra wrote:
> 
>> > But my company has extensive user space code that maintains a lot of
>> > counters and does other tricks to get full performance out of the
>> > hardware. Such a mechanism would also be good from user space. Why keep
>> > the good stuff only inside the kernel?
>>
>> Mathieu's proposal is for userspace, _only_ userspace.
> 
> But what we were talking about are instructions that work effectively in
> kernel space whose efficiency restartable sequences could bring to user
> space.

It can be worthwhile to recap my understanding of this thread so far:

AFAIU, Chris' proposal is to use the "gs" segment selector as instruction
prefix on x86 rather than explicitly loading CPU number and calculating
offsets.

This can turn sequences of rseq operations like this cmpxchg:

Registers:

  R1: return value
  R2: expected value
  R3: new value
  R4: cpu_id

rseq cmpxchg:

  load TLS::cpu_id_start into R4
  calculate offset of v
  fs:mov (store rseq descriptor address into TLS::rseq_cs)
  compare R4 against TLS::cpu_id
  jne abort
  mov (load v into R1)
  compare R1 against R2
  jne cmpfail
  mov (store R3 into *v)

into:

  fs:mov (store rseq descriptor address into TLS::rseq_cs)
  gs:mov (load *v+off into R1)
  compare R1 against R2
  jne cmpfail
  gs:mov (store R3 into *v+off)

My first concern with this approach is the lack of flexibility of the segment
selector method wrt variety of schemes user-space has to deal with for memory
allocation. In the kernel, this is achieved by ensuring that all per-cpu data
layout is segment-selector-prefix friendly.

Another aspect that worries me is applications using the gs segment selector
for other purposes. Suddenly reserving the gs segment selector for use by a
library like glibc may lead to incompatibilities with applications already
using it.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-15 Thread Christopher Lameter
On Thu, 14 Dec 2017, Peter Zijlstra wrote:

> > But my company has extensive user space code that maintains a lot of
> > counters and does other tricks to get full performance out of the
> > hardware. Such a mechanism would also be good from user space. Why keep
> > the good stuff only inside the kernel?
>
> Mathieu's proposal is for userspace, _only_ userspace.

But what we were talking about are instructions that work effectively in
kernel space whose efficiency restartable sequences could bring to user
space.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-15 Thread Christopher Lameter
On Thu, 14 Dec 2017, Peter Zijlstra wrote:

> > But my company has extensive user space code that maintains a lot of
> > counters and does other tricks to get full performance out of the
> > hardware. Such a mechanism would also be good from user space. Why keep
> > the good stuff only inside the kernel?
>
> Mathieu's proposal is for userspace, _only_ userspace.

But what we were talking about are instructions that work effectively in
kernel space whose efficiency restartable sequences could bring to user
space.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 03:14:00PM -0600, Christopher Lameter wrote:
> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
> > If we port this concept to kernel-space (as I start to understand
> > would be your wish), then a simple pointer store to the current
> > task_struct would suffice.
> 
> Certainly such a port would be beneficial for non x86 archs.
> 
> But my company has extensive user space code that maintains a lot of
> counters and does other tricks to get full performance out of the
> hardware. Such a mechanism would also be good from user space. Why keep
> the good stuff only inside the kernel?

Mathieu's proposal is for userspace, _only_ userspace.


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 03:14:00PM -0600, Christopher Lameter wrote:
> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
> > If we port this concept to kernel-space (as I start to understand
> > would be your wish), then a simple pointer store to the current
> > task_struct would suffice.
> 
> Certainly such a port would be beneficial for non x86 archs.
> 
> But my company has extensive user space code that maintains a lot of
> counters and does other tricks to get full performance out of the
> hardware. Such a mechanism would also be good from user space. Why keep
> the good stuff only inside the kernel?

Mathieu's proposal is for userspace, _only_ userspace.


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Christopher Lameter
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:

> If we port this concept to kernel-space (as I start to understand
> would be your wish), then a simple pointer store to the current
> task_struct would suffice.

Certainly such a port would be beneficial for non x86 archs.

But my company has extensive user space code that maintains a lot of
counters and does other tricks to get full performance out of the
hardware. Such a mechanism would also be good from user space. Why keep
the good stuff only inside the kernel?


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Christopher Lameter
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:

> If we port this concept to kernel-space (as I start to understand
> would be your wish), then a simple pointer store to the current
> task_struct would suffice.

Certainly such a port would be beneficial for non x86 archs.

But my company has extensive user space code that maintains a lot of
counters and does other tricks to get full performance out of the
hardware. Such a mechanism would also be good from user space. Why keep
the good stuff only inside the kernel?


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 07:57:08PM +, Mathieu Desnoyers wrote:
> - On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote:
> 
> > On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote:
> >> Ultimately I wish fast increments like done by this_cpu_inc() could be
> >> implemented in an efficient way on non x86 platforms that do not have
> >> cheap instructions like that.
> > 
> > So the problem isn't migration; for that we could wrap the operation in
> > preempt_disable() which is not more expensive than rseq would be. And a
> > lot more deterministic.
> > 
> > The problem instead is interrupts, which can result in nested load-store
> > operations, and that comes apart. This then means having to disable
> > interrupts over these things and _that_ is expensive.
> 
> Then could we consider checking a per task-struct rseq_cs pointer when
> returning from interrupt handler ? This rseq_cs pointer would track
> kernel restartable sequences. This would also work for NMI handlers.

I really don't much like making the interrupt handlers more expensive
for this.

And I don't think NMIs are a real worry, you should be very careful when
you share state with any of them in any case.

Also; what you can do is soft interrupt disable, which is effectively
the oppose approach, instead of restarting the sequence, you delay the
interrupt handler. And that has the obvious benefit of making all the
local_irq_disable/enable crud much faster all over.

This is something PowerPC already does.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 07:57:08PM +, Mathieu Desnoyers wrote:
> - On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote:
> 
> > On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote:
> >> Ultimately I wish fast increments like done by this_cpu_inc() could be
> >> implemented in an efficient way on non x86 platforms that do not have
> >> cheap instructions like that.
> > 
> > So the problem isn't migration; for that we could wrap the operation in
> > preempt_disable() which is not more expensive than rseq would be. And a
> > lot more deterministic.
> > 
> > The problem instead is interrupts, which can result in nested load-store
> > operations, and that comes apart. This then means having to disable
> > interrupts over these things and _that_ is expensive.
> 
> Then could we consider checking a per task-struct rseq_cs pointer when
> returning from interrupt handler ? This rseq_cs pointer would track
> kernel restartable sequences. This would also work for NMI handlers.

I really don't much like making the interrupt handlers more expensive
for this.

And I don't think NMIs are a real worry, you should be very careful when
you share state with any of them in any case.

Also; what you can do is soft interrupt disable, which is effectively
the oppose approach, instead of restarting the sequence, you delay the
interrupt handler. And that has the obvious benefit of making all the
local_irq_disable/enable crud much faster all over.

This is something PowerPC already does.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Mathieu Desnoyers
- On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote:

> On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote:
>> Ultimately I wish fast increments like done by this_cpu_inc() could be
>> implemented in an efficient way on non x86 platforms that do not have
>> cheap instructions like that.
> 
> So the problem isn't migration; for that we could wrap the operation in
> preempt_disable() which is not more expensive than rseq would be. And a
> lot more deterministic.
> 
> The problem instead is interrupts, which can result in nested load-store
> operations, and that comes apart. This then means having to disable
> interrupts over these things and _that_ is expensive.

Then could we consider checking a per task-struct rseq_cs pointer when
returning from interrupt handler ? This rseq_cs pointer would track
kernel restartable sequences. This would also work for NMI handlers.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Mathieu Desnoyers
- On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote:

> On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote:
>> Ultimately I wish fast increments like done by this_cpu_inc() could be
>> implemented in an efficient way on non x86 platforms that do not have
>> cheap instructions like that.
> 
> So the problem isn't migration; for that we could wrap the operation in
> preempt_disable() which is not more expensive than rseq would be. And a
> lot more deterministic.
> 
> The problem instead is interrupts, which can result in nested load-store
> operations, and that comes apart. This then means having to disable
> interrupts over these things and _that_ is expensive.

Then could we consider checking a per task-struct rseq_cs pointer when
returning from interrupt handler ? This rseq_cs pointer would track
kernel restartable sequences. This would also work for NMI handlers.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote:
> Ultimately I wish fast increments like done by this_cpu_inc() could be
> implemented in an efficient way on non x86 platforms that do not have
> cheap instructions like that.

So the problem isn't migration; for that we could wrap the operation in
preempt_disable() which is not more expensive than rseq would be. And a
lot more deterministic.

The problem instead is interrupts, which can result in nested load-store
operations, and that comes apart. This then means having to disable
interrupts over these things and _that_ is expensive.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Peter Zijlstra
On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote:
> Ultimately I wish fast increments like done by this_cpu_inc() could be
> implemented in an efficient way on non x86 platforms that do not have
> cheap instructions like that.

So the problem isn't migration; for that we could wrap the operation in
preempt_disable() which is not more expensive than rseq would be. And a
lot more deterministic.

The problem instead is interrupts, which can result in nested load-store
operations, and that comes apart. This then means having to disable
interrupts over these things and _that_ is expensive.



Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Mathieu Desnoyers
- On Dec 14, 2017, at 1:50 PM, Chris Lameter c...@linux.com wrote:

> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
>> > I think the proper way to think about gs and fs on x86 is as base
>> > registers. They are essentially values in registers added to the address
>> > generated in an instruction. As such the approach is transferable to other
>> > processor architecture. Many support base register and base register
>> > relative processing. If a processor can do RMV instructions base register
>> > relative then you have something similar.
>>
>> How would you do it on ARM32 ?
> 
> Actually you do not really need RMV instructions. The data is cpu specific
> so within a restartable sequence you would have exclusive access right?

Yep.

> 
> F.e. a increment would be
> 
> 1. Load base register relative
> 2. add 1
> 3. Store base register relative

Actually, for the increment case, rseq headers provide a "add" API,
which uses a "add" instruction on x86. On arm 32, it does indeed:

RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
"ldr r0, %[v]\n\t"
"add r0, %[count]\n\t"
/* final store */
"str r0, %[v]\n\t"

> 
> The main overhead would be the registeration of the sequence.

Registering a rseq is a single store to a TLS (in user-space),
which really isn't that expensive.

If we port this concept to kernel-space (as I start to understand
would be your wish), then a simple pointer store to the current
task_struct would suffice.

> 
> The advantage on x86 is that you do not need a restartable sequence
> since a single lockless RMV instruction can do this (this_cpu_inc f.e.)

Indeed, on x86, for the specific case of counter increment, the 
single-instruction
"add" or "inc" with a segment-selector prefix can save setting up the rseq
(a pointer store), and offsetting from a base using the cpu number.

If your wish is to do this at kernel level, where we have full control over
the gs segment, this makes sense. I'm worried that applying this to user-space
might create conflicts wrt who owns that segment selector register wrt
pre-existing applications.

> 
>> One benefit of your proposal is to lessen the number of retired instructions,
>> but if we take the IPC into account, it is slower than rseq in my benchmark.
>> What
>> benefits do you expect from using segment selectors and non-lock-prefixed 
>> atomic
>> instructions on the fast-path ?
> 
> Ultimately I wish fast increments like done by this_cpu_inc() could be
> implemented in an efficient way on non x86 platforms that do not have
> cheap instructions like that.

My understanding is that your focus is mainly on kernel code, right ? Or is
your aim to port this_cpu_inc() to userspace as well ?

Indeed, the concepts behind rseq could be ported to kernel code eventually.
The immediate gain is much higher by exposing this to user-space though,
given that there is no good way to perform per-cpu operations efficiently
at all there, whereas kernel code can always disable preemption.

> 
> If cmpxchg local is slower than a group of instructions to do the same
> then there is an obvious question to the cpu architects why we would need
> the instruction at all (aside from the fact that we do not need a
> restartable sequence for these instructions).

I'm not a specialist in CPU instruction scheduling, so I won't speculate
on this topic. ;-)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Mathieu Desnoyers
- On Dec 14, 2017, at 1:50 PM, Chris Lameter c...@linux.com wrote:

> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
>> > I think the proper way to think about gs and fs on x86 is as base
>> > registers. They are essentially values in registers added to the address
>> > generated in an instruction. As such the approach is transferable to other
>> > processor architecture. Many support base register and base register
>> > relative processing. If a processor can do RMV instructions base register
>> > relative then you have something similar.
>>
>> How would you do it on ARM32 ?
> 
> Actually you do not really need RMV instructions. The data is cpu specific
> so within a restartable sequence you would have exclusive access right?

Yep.

> 
> F.e. a increment would be
> 
> 1. Load base register relative
> 2. add 1
> 3. Store base register relative

Actually, for the increment case, rseq headers provide a "add" API,
which uses a "add" instruction on x86. On arm 32, it does indeed:

RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
"ldr r0, %[v]\n\t"
"add r0, %[count]\n\t"
/* final store */
"str r0, %[v]\n\t"

> 
> The main overhead would be the registeration of the sequence.

Registering a rseq is a single store to a TLS (in user-space),
which really isn't that expensive.

If we port this concept to kernel-space (as I start to understand
would be your wish), then a simple pointer store to the current
task_struct would suffice.

> 
> The advantage on x86 is that you do not need a restartable sequence
> since a single lockless RMV instruction can do this (this_cpu_inc f.e.)

Indeed, on x86, for the specific case of counter increment, the 
single-instruction
"add" or "inc" with a segment-selector prefix can save setting up the rseq
(a pointer store), and offsetting from a base using the cpu number.

If your wish is to do this at kernel level, where we have full control over
the gs segment, this makes sense. I'm worried that applying this to user-space
might create conflicts wrt who owns that segment selector register wrt
pre-existing applications.

> 
>> One benefit of your proposal is to lessen the number of retired instructions,
>> but if we take the IPC into account, it is slower than rseq in my benchmark.
>> What
>> benefits do you expect from using segment selectors and non-lock-prefixed 
>> atomic
>> instructions on the fast-path ?
> 
> Ultimately I wish fast increments like done by this_cpu_inc() could be
> implemented in an efficient way on non x86 platforms that do not have
> cheap instructions like that.

My understanding is that your focus is mainly on kernel code, right ? Or is
your aim to port this_cpu_inc() to userspace as well ?

Indeed, the concepts behind rseq could be ported to kernel code eventually.
The immediate gain is much higher by exposing this to user-space though,
given that there is no good way to perform per-cpu operations efficiently
at all there, whereas kernel code can always disable preemption.

> 
> If cmpxchg local is slower than a group of instructions to do the same
> then there is an obvious question to the cpu architects why we would need
> the instruction at all (aside from the fact that we do not need a
> restartable sequence for these instructions).

I'm not a specialist in CPU instruction scheduling, so I won't speculate
on this topic. ;-)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Christopher Lameter
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:

> > I think the proper way to think about gs and fs on x86 is as base
> > registers. They are essentially values in registers added to the address
> > generated in an instruction. As such the approach is transferable to other
> > processor architecture. Many support base register and base register
> > relative processing. If a processor can do RMV instructions base register
> > relative then you have something similar.
>
> How would you do it on ARM32 ?

Actually you do not really need RMV instructions. The data is cpu specific
so within a restartable sequence you would have exclusive access right?

F.e. a increment would be

1. Load base register relative
2. add 1
3. Store base register relative

The main overhead would be the registeration of the sequence.

The advantage on x86 is that you do not need a restartable sequence
since a single lockless RMV instruction can do this (this_cpu_inc f.e.)

> One benefit of your proposal is to lessen the number of retired instructions,
> but if we take the IPC into account, it is slower than rseq in my benchmark. 
> What
> benefits do you expect from using segment selectors and non-lock-prefixed 
> atomic
> instructions on the fast-path ?

Ultimately I wish fast increments like done by this_cpu_inc() could be
implemented in an efficient way on non x86 platforms that do not have
cheap instructions like that.

If cmpxchg local is slower than a group of instructions to do the same
then there is an obvious question to the cpu architects why we would need
the instruction at all (aside from the fact that we do not need a
restartable sequence for these instructions).


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Christopher Lameter
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:

> > I think the proper way to think about gs and fs on x86 is as base
> > registers. They are essentially values in registers added to the address
> > generated in an instruction. As such the approach is transferable to other
> > processor architecture. Many support base register and base register
> > relative processing. If a processor can do RMV instructions base register
> > relative then you have something similar.
>
> How would you do it on ARM32 ?

Actually you do not really need RMV instructions. The data is cpu specific
so within a restartable sequence you would have exclusive access right?

F.e. a increment would be

1. Load base register relative
2. add 1
3. Store base register relative

The main overhead would be the registeration of the sequence.

The advantage on x86 is that you do not need a restartable sequence
since a single lockless RMV instruction can do this (this_cpu_inc f.e.)

> One benefit of your proposal is to lessen the number of retired instructions,
> but if we take the IPC into account, it is slower than rseq in my benchmark. 
> What
> benefits do you expect from using segment selectors and non-lock-prefixed 
> atomic
> instructions on the fast-path ?

Ultimately I wish fast increments like done by this_cpu_inc() could be
implemented in an efficient way on non x86 platforms that do not have
cheap instructions like that.

If cmpxchg local is slower than a group of instructions to do the same
then there is an obvious question to the cpu architects why we would need
the instruction at all (aside from the fact that we do not need a
restartable sequence for these instructions).


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Mathieu Desnoyers
- On Dec 14, 2017, at 11:44 AM, Chris Lameter c...@linux.com wrote:

> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the cpu id cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
> 
> I think the proper way to think about gs and fs on x86 is as base
> registers. They are essentially values in registers added to the address
> generated in an instruction. As such the approach is transferable to other
> processor architecture. Many support base register and base register
> relative processing. If a processor can do RMV instructions base register
> relative then you have something similar.

How would you do it on ARM32 ?

> 
> In a restartable sequence you could increase efficieny by avoiding full
> atomic instructions. This would be similar to the lockless RMV available
> on x86 then. And in that form it is portable.
> 
> A context switch to another processors would mean that the value of the
> base register has changed and that we therefore are accessing another per
> cpu segment. Restarting the sequence will yield a correct result without
> any reloading of registers.

As a concrete example, let's try to apply your proposal on a common use-case:
a compare-and-store on user-space per-cpu data.

With my rseq proposal the fast-path pseudo-code boils down to:

load TLS::cpu_id_start into reg_X
add reg_X offset to base to find target v
store pointer to TLS::rseq_cs
compare reg_X against TLS::cpu_id
jne abort
cmp *v, value
jne cmpfail
store newval to *v

My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration
for a test-case incrementing a counter with this rseq compare-and-store
sequence.

Let's assume we can reserve the gs segment selector for use in user-space,
and that the per-cpu data layout allows using this segment selector as offset.
The compare-and-store use-case would require a "cmpxchg" instruction with
a gs segment selector.

A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop
on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a 
single
global value)

One benefit of your proposal is to lessen the number of retired instructions,
but if we take the IPC into account, it is slower than rseq in my benchmark. 
What
benefits do you expect from using segment selectors and non-lock-prefixed atomic
instructions on the fast-path ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Mathieu Desnoyers
- On Dec 14, 2017, at 11:44 AM, Chris Lameter c...@linux.com wrote:

> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
> 
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the cpu id cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
> 
> I think the proper way to think about gs and fs on x86 is as base
> registers. They are essentially values in registers added to the address
> generated in an instruction. As such the approach is transferable to other
> processor architecture. Many support base register and base register
> relative processing. If a processor can do RMV instructions base register
> relative then you have something similar.

How would you do it on ARM32 ?

> 
> In a restartable sequence you could increase efficieny by avoiding full
> atomic instructions. This would be similar to the lockless RMV available
> on x86 then. And in that form it is portable.
> 
> A context switch to another processors would mean that the value of the
> base register has changed and that we therefore are accessing another per
> cpu segment. Restarting the sequence will yield a correct result without
> any reloading of registers.

As a concrete example, let's try to apply your proposal on a common use-case:
a compare-and-store on user-space per-cpu data.

With my rseq proposal the fast-path pseudo-code boils down to:

load TLS::cpu_id_start into reg_X
add reg_X offset to base to find target v
store pointer to TLS::rseq_cs
compare reg_X against TLS::cpu_id
jne abort
cmp *v, value
jne cmpfail
store newval to *v

My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration
for a test-case incrementing a counter with this rseq compare-and-store
sequence.

Let's assume we can reserve the gs segment selector for use in user-space,
and that the per-cpu data layout allows using this segment selector as offset.
The compare-and-store use-case would require a "cmpxchg" instruction with
a gs segment selector.

A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop
on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a 
single
global value)

One benefit of your proposal is to lessen the number of retired instructions,
but if we take the IPC into account, it is slower than rseq in my benchmark. 
What
benefits do you expect from using segment selectors and non-lock-prefixed atomic
instructions on the fast-path ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Christopher Lameter
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:

> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the cpu id cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.

I think the proper way to think about gs and fs on x86 is as base
registers. They are essentially values in registers added to the address
generated in an instruction. As such the approach is transferable to other
processor architecture. Many support base register and base register
relative processing. If a processor can do RMV instructions base register
relative then you have something similar.

In a restartable sequence you could increase efficieny by avoiding full
atomic instructions. This would be similar to the lockless RMV available
on x86 then. And in that form it is portable.

A context switch to another processors would mean that the value of the
base register has changed and that we therefore are accessing another per
cpu segment. Restarting the sequence will yield a correct result without
any reloading of registers.





Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

2017-12-14 Thread Christopher Lameter
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:

> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the cpu id cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.

I think the proper way to think about gs and fs on x86 is as base
registers. They are essentially values in registers added to the address
generated in an instruction. As such the approach is transferable to other
processor architecture. Many support base register and base register
relative processing. If a processor can do RMV instructions base register
relative then you have something similar.

In a restartable sequence you could increase efficieny by avoiding full
atomic instructions. This would be similar to the lockless RMV available
on x86 then. And in that form it is portable.

A context switch to another processors would mean that the value of the
base register has changed and that we therefore are accessing another per
cpu segment. Restarting the sequence will yield a correct result without
any reloading of registers.