Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Fri, 15 Dec 2017, Mathieu Desnoyers wrote: > Another aspect that worries me is applications using the gs segment selector > for other purposes. Suddenly reserving the gs segment selector for use by a > library like glibc may lead to incompatibilities with applications already > using it. fs/gs seems to be reserved for thread local storage. So it would be shared in user space like the corresponding cpu segment register in kernel space where multiple subsystems share %gs. The same can be done in user space. Ulrich Drepper has a writeup on this https://www.akkadia.org/drepper/tls.pdf Savings in execution time could come about because there would not be the need to determine the address of the processor specific memory area in each restartable sequence and there would be memory free of contention for such a sequence in order f.e. to realize fast counters.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Fri, 15 Dec 2017, Mathieu Desnoyers wrote: > Another aspect that worries me is applications using the gs segment selector > for other purposes. Suddenly reserving the gs segment selector for use by a > library like glibc may lead to incompatibilities with applications already > using it. fs/gs seems to be reserved for thread local storage. So it would be shared in user space like the corresponding cpu segment register in kernel space where multiple subsystems share %gs. The same can be done in user space. Ulrich Drepper has a writeup on this https://www.akkadia.org/drepper/tls.pdf Savings in execution time could come about because there would not be the need to determine the address of the processor specific memory area in each restartable sequence and there would be memory free of contention for such a sequence in order f.e. to realize fast counters.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 15, 2017, at 10:05 AM, Chris Lameter c...@linux.com wrote: > On Thu, 14 Dec 2017, Peter Zijlstra wrote: > >> > But my company has extensive user space code that maintains a lot of >> > counters and does other tricks to get full performance out of the >> > hardware. Such a mechanism would also be good from user space. Why keep >> > the good stuff only inside the kernel? >> >> Mathieu's proposal is for userspace, _only_ userspace. > > But what we were talking about are instructions that work effectively in > kernel space whose efficiency restartable sequences could bring to user > space. It can be worthwhile to recap my understanding of this thread so far: AFAIU, Chris' proposal is to use the "gs" segment selector as instruction prefix on x86 rather than explicitly loading CPU number and calculating offsets. This can turn sequences of rseq operations like this cmpxchg: Registers: R1: return value R2: expected value R3: new value R4: cpu_id rseq cmpxchg: load TLS::cpu_id_start into R4 calculate offset of v fs:mov (store rseq descriptor address into TLS::rseq_cs) compare R4 against TLS::cpu_id jne abort mov (load v into R1) compare R1 against R2 jne cmpfail mov (store R3 into *v) into: fs:mov (store rseq descriptor address into TLS::rseq_cs) gs:mov (load *v+off into R1) compare R1 against R2 jne cmpfail gs:mov (store R3 into *v+off) My first concern with this approach is the lack of flexibility of the segment selector method wrt variety of schemes user-space has to deal with for memory allocation. In the kernel, this is achieved by ensuring that all per-cpu data layout is segment-selector-prefix friendly. Another aspect that worries me is applications using the gs segment selector for other purposes. Suddenly reserving the gs segment selector for use by a library like glibc may lead to incompatibilities with applications already using it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 15, 2017, at 10:05 AM, Chris Lameter c...@linux.com wrote: > On Thu, 14 Dec 2017, Peter Zijlstra wrote: > >> > But my company has extensive user space code that maintains a lot of >> > counters and does other tricks to get full performance out of the >> > hardware. Such a mechanism would also be good from user space. Why keep >> > the good stuff only inside the kernel? >> >> Mathieu's proposal is for userspace, _only_ userspace. > > But what we were talking about are instructions that work effectively in > kernel space whose efficiency restartable sequences could bring to user > space. It can be worthwhile to recap my understanding of this thread so far: AFAIU, Chris' proposal is to use the "gs" segment selector as instruction prefix on x86 rather than explicitly loading CPU number and calculating offsets. This can turn sequences of rseq operations like this cmpxchg: Registers: R1: return value R2: expected value R3: new value R4: cpu_id rseq cmpxchg: load TLS::cpu_id_start into R4 calculate offset of v fs:mov (store rseq descriptor address into TLS::rseq_cs) compare R4 against TLS::cpu_id jne abort mov (load v into R1) compare R1 against R2 jne cmpfail mov (store R3 into *v) into: fs:mov (store rseq descriptor address into TLS::rseq_cs) gs:mov (load *v+off into R1) compare R1 against R2 jne cmpfail gs:mov (store R3 into *v+off) My first concern with this approach is the lack of flexibility of the segment selector method wrt variety of schemes user-space has to deal with for memory allocation. In the kernel, this is achieved by ensuring that all per-cpu data layout is segment-selector-prefix friendly. Another aspect that worries me is applications using the gs segment selector for other purposes. Suddenly reserving the gs segment selector for use by a library like glibc may lead to incompatibilities with applications already using it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Peter Zijlstra wrote: > > But my company has extensive user space code that maintains a lot of > > counters and does other tricks to get full performance out of the > > hardware. Such a mechanism would also be good from user space. Why keep > > the good stuff only inside the kernel? > > Mathieu's proposal is for userspace, _only_ userspace. But what we were talking about are instructions that work effectively in kernel space whose efficiency restartable sequences could bring to user space.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Peter Zijlstra wrote: > > But my company has extensive user space code that maintains a lot of > > counters and does other tricks to get full performance out of the > > hardware. Such a mechanism would also be good from user space. Why keep > > the good stuff only inside the kernel? > > Mathieu's proposal is for userspace, _only_ userspace. But what we were talking about are instructions that work effectively in kernel space whose efficiency restartable sequences could bring to user space.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, Dec 14, 2017 at 03:14:00PM -0600, Christopher Lameter wrote: > On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > > > If we port this concept to kernel-space (as I start to understand > > would be your wish), then a simple pointer store to the current > > task_struct would suffice. > > Certainly such a port would be beneficial for non x86 archs. > > But my company has extensive user space code that maintains a lot of > counters and does other tricks to get full performance out of the > hardware. Such a mechanism would also be good from user space. Why keep > the good stuff only inside the kernel? Mathieu's proposal is for userspace, _only_ userspace.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, Dec 14, 2017 at 03:14:00PM -0600, Christopher Lameter wrote: > On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > > > If we port this concept to kernel-space (as I start to understand > > would be your wish), then a simple pointer store to the current > > task_struct would suffice. > > Certainly such a port would be beneficial for non x86 archs. > > But my company has extensive user space code that maintains a lot of > counters and does other tricks to get full performance out of the > hardware. Such a mechanism would also be good from user space. Why keep > the good stuff only inside the kernel? Mathieu's proposal is for userspace, _only_ userspace.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > If we port this concept to kernel-space (as I start to understand > would be your wish), then a simple pointer store to the current > task_struct would suffice. Certainly such a port would be beneficial for non x86 archs. But my company has extensive user space code that maintains a lot of counters and does other tricks to get full performance out of the hardware. Such a mechanism would also be good from user space. Why keep the good stuff only inside the kernel?
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > If we port this concept to kernel-space (as I start to understand > would be your wish), then a simple pointer store to the current > task_struct would suffice. Certainly such a port would be beneficial for non x86 archs. But my company has extensive user space code that maintains a lot of counters and does other tricks to get full performance out of the hardware. Such a mechanism would also be good from user space. Why keep the good stuff only inside the kernel?
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, Dec 14, 2017 at 07:57:08PM +, Mathieu Desnoyers wrote: > - On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote: > > > On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote: > >> Ultimately I wish fast increments like done by this_cpu_inc() could be > >> implemented in an efficient way on non x86 platforms that do not have > >> cheap instructions like that. > > > > So the problem isn't migration; for that we could wrap the operation in > > preempt_disable() which is not more expensive than rseq would be. And a > > lot more deterministic. > > > > The problem instead is interrupts, which can result in nested load-store > > operations, and that comes apart. This then means having to disable > > interrupts over these things and _that_ is expensive. > > Then could we consider checking a per task-struct rseq_cs pointer when > returning from interrupt handler ? This rseq_cs pointer would track > kernel restartable sequences. This would also work for NMI handlers. I really don't much like making the interrupt handlers more expensive for this. And I don't think NMIs are a real worry, you should be very careful when you share state with any of them in any case. Also; what you can do is soft interrupt disable, which is effectively the oppose approach, instead of restarting the sequence, you delay the interrupt handler. And that has the obvious benefit of making all the local_irq_disable/enable crud much faster all over. This is something PowerPC already does.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, Dec 14, 2017 at 07:57:08PM +, Mathieu Desnoyers wrote: > - On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote: > > > On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote: > >> Ultimately I wish fast increments like done by this_cpu_inc() could be > >> implemented in an efficient way on non x86 platforms that do not have > >> cheap instructions like that. > > > > So the problem isn't migration; for that we could wrap the operation in > > preempt_disable() which is not more expensive than rseq would be. And a > > lot more deterministic. > > > > The problem instead is interrupts, which can result in nested load-store > > operations, and that comes apart. This then means having to disable > > interrupts over these things and _that_ is expensive. > > Then could we consider checking a per task-struct rseq_cs pointer when > returning from interrupt handler ? This rseq_cs pointer would track > kernel restartable sequences. This would also work for NMI handlers. I really don't much like making the interrupt handlers more expensive for this. And I don't think NMIs are a real worry, you should be very careful when you share state with any of them in any case. Also; what you can do is soft interrupt disable, which is effectively the oppose approach, instead of restarting the sequence, you delay the interrupt handler. And that has the obvious benefit of making all the local_irq_disable/enable crud much faster all over. This is something PowerPC already does.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote: > On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote: >> Ultimately I wish fast increments like done by this_cpu_inc() could be >> implemented in an efficient way on non x86 platforms that do not have >> cheap instructions like that. > > So the problem isn't migration; for that we could wrap the operation in > preempt_disable() which is not more expensive than rseq would be. And a > lot more deterministic. > > The problem instead is interrupts, which can result in nested load-store > operations, and that comes apart. This then means having to disable > interrupts over these things and _that_ is expensive. Then could we consider checking a per task-struct rseq_cs pointer when returning from interrupt handler ? This rseq_cs pointer would track kernel restartable sequences. This would also work for NMI handlers. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 14, 2017, at 2:48 PM, Peter Zijlstra pet...@infradead.org wrote: > On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote: >> Ultimately I wish fast increments like done by this_cpu_inc() could be >> implemented in an efficient way on non x86 platforms that do not have >> cheap instructions like that. > > So the problem isn't migration; for that we could wrap the operation in > preempt_disable() which is not more expensive than rseq would be. And a > lot more deterministic. > > The problem instead is interrupts, which can result in nested load-store > operations, and that comes apart. This then means having to disable > interrupts over these things and _that_ is expensive. Then could we consider checking a per task-struct rseq_cs pointer when returning from interrupt handler ? This rseq_cs pointer would track kernel restartable sequences. This would also work for NMI handlers. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote: > Ultimately I wish fast increments like done by this_cpu_inc() could be > implemented in an efficient way on non x86 platforms that do not have > cheap instructions like that. So the problem isn't migration; for that we could wrap the operation in preempt_disable() which is not more expensive than rseq would be. And a lot more deterministic. The problem instead is interrupts, which can result in nested load-store operations, and that comes apart. This then means having to disable interrupts over these things and _that_ is expensive.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, Dec 14, 2017 at 12:50:13PM -0600, Christopher Lameter wrote: > Ultimately I wish fast increments like done by this_cpu_inc() could be > implemented in an efficient way on non x86 platforms that do not have > cheap instructions like that. So the problem isn't migration; for that we could wrap the operation in preempt_disable() which is not more expensive than rseq would be. And a lot more deterministic. The problem instead is interrupts, which can result in nested load-store operations, and that comes apart. This then means having to disable interrupts over these things and _that_ is expensive.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 14, 2017, at 1:50 PM, Chris Lameter c...@linux.com wrote: > On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > >> > I think the proper way to think about gs and fs on x86 is as base >> > registers. They are essentially values in registers added to the address >> > generated in an instruction. As such the approach is transferable to other >> > processor architecture. Many support base register and base register >> > relative processing. If a processor can do RMV instructions base register >> > relative then you have something similar. >> >> How would you do it on ARM32 ? > > Actually you do not really need RMV instructions. The data is cpu specific > so within a restartable sequence you would have exclusive access right? Yep. > > F.e. a increment would be > > 1. Load base register relative > 2. add 1 > 3. Store base register relative Actually, for the increment case, rseq headers provide a "add" API, which uses a "add" instruction on x86. On arm 32, it does indeed: RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs) RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f) "ldr r0, %[v]\n\t" "add r0, %[count]\n\t" /* final store */ "str r0, %[v]\n\t" > > The main overhead would be the registeration of the sequence. Registering a rseq is a single store to a TLS (in user-space), which really isn't that expensive. If we port this concept to kernel-space (as I start to understand would be your wish), then a simple pointer store to the current task_struct would suffice. > > The advantage on x86 is that you do not need a restartable sequence > since a single lockless RMV instruction can do this (this_cpu_inc f.e.) Indeed, on x86, for the specific case of counter increment, the single-instruction "add" or "inc" with a segment-selector prefix can save setting up the rseq (a pointer store), and offsetting from a base using the cpu number. If your wish is to do this at kernel level, where we have full control over the gs segment, this makes sense. I'm worried that applying this to user-space might create conflicts wrt who owns that segment selector register wrt pre-existing applications. > >> One benefit of your proposal is to lessen the number of retired instructions, >> but if we take the IPC into account, it is slower than rseq in my benchmark. >> What >> benefits do you expect from using segment selectors and non-lock-prefixed >> atomic >> instructions on the fast-path ? > > Ultimately I wish fast increments like done by this_cpu_inc() could be > implemented in an efficient way on non x86 platforms that do not have > cheap instructions like that. My understanding is that your focus is mainly on kernel code, right ? Or is your aim to port this_cpu_inc() to userspace as well ? Indeed, the concepts behind rseq could be ported to kernel code eventually. The immediate gain is much higher by exposing this to user-space though, given that there is no good way to perform per-cpu operations efficiently at all there, whereas kernel code can always disable preemption. > > If cmpxchg local is slower than a group of instructions to do the same > then there is an obvious question to the cpu architects why we would need > the instruction at all (aside from the fact that we do not need a > restartable sequence for these instructions). I'm not a specialist in CPU instruction scheduling, so I won't speculate on this topic. ;-) Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 14, 2017, at 1:50 PM, Chris Lameter c...@linux.com wrote: > On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > >> > I think the proper way to think about gs and fs on x86 is as base >> > registers. They are essentially values in registers added to the address >> > generated in an instruction. As such the approach is transferable to other >> > processor architecture. Many support base register and base register >> > relative processing. If a processor can do RMV instructions base register >> > relative then you have something similar. >> >> How would you do it on ARM32 ? > > Actually you do not really need RMV instructions. The data is cpu specific > so within a restartable sequence you would have exclusive access right? Yep. > > F.e. a increment would be > > 1. Load base register relative > 2. add 1 > 3. Store base register relative Actually, for the increment case, rseq headers provide a "add" API, which uses a "add" instruction on x86. On arm 32, it does indeed: RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs) RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f) "ldr r0, %[v]\n\t" "add r0, %[count]\n\t" /* final store */ "str r0, %[v]\n\t" > > The main overhead would be the registeration of the sequence. Registering a rseq is a single store to a TLS (in user-space), which really isn't that expensive. If we port this concept to kernel-space (as I start to understand would be your wish), then a simple pointer store to the current task_struct would suffice. > > The advantage on x86 is that you do not need a restartable sequence > since a single lockless RMV instruction can do this (this_cpu_inc f.e.) Indeed, on x86, for the specific case of counter increment, the single-instruction "add" or "inc" with a segment-selector prefix can save setting up the rseq (a pointer store), and offsetting from a base using the cpu number. If your wish is to do this at kernel level, where we have full control over the gs segment, this makes sense. I'm worried that applying this to user-space might create conflicts wrt who owns that segment selector register wrt pre-existing applications. > >> One benefit of your proposal is to lessen the number of retired instructions, >> but if we take the IPC into account, it is slower than rseq in my benchmark. >> What >> benefits do you expect from using segment selectors and non-lock-prefixed >> atomic >> instructions on the fast-path ? > > Ultimately I wish fast increments like done by this_cpu_inc() could be > implemented in an efficient way on non x86 platforms that do not have > cheap instructions like that. My understanding is that your focus is mainly on kernel code, right ? Or is your aim to port this_cpu_inc() to userspace as well ? Indeed, the concepts behind rseq could be ported to kernel code eventually. The immediate gain is much higher by exposing this to user-space though, given that there is no good way to perform per-cpu operations efficiently at all there, whereas kernel code can always disable preemption. > > If cmpxchg local is slower than a group of instructions to do the same > then there is an obvious question to the cpu architects why we would need > the instruction at all (aside from the fact that we do not need a > restartable sequence for these instructions). I'm not a specialist in CPU instruction scheduling, so I won't speculate on this topic. ;-) Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > > I think the proper way to think about gs and fs on x86 is as base > > registers. They are essentially values in registers added to the address > > generated in an instruction. As such the approach is transferable to other > > processor architecture. Many support base register and base register > > relative processing. If a processor can do RMV instructions base register > > relative then you have something similar. > > How would you do it on ARM32 ? Actually you do not really need RMV instructions. The data is cpu specific so within a restartable sequence you would have exclusive access right? F.e. a increment would be 1. Load base register relative 2. add 1 3. Store base register relative The main overhead would be the registeration of the sequence. The advantage on x86 is that you do not need a restartable sequence since a single lockless RMV instruction can do this (this_cpu_inc f.e.) > One benefit of your proposal is to lessen the number of retired instructions, > but if we take the IPC into account, it is slower than rseq in my benchmark. > What > benefits do you expect from using segment selectors and non-lock-prefixed > atomic > instructions on the fast-path ? Ultimately I wish fast increments like done by this_cpu_inc() could be implemented in an efficient way on non x86 platforms that do not have cheap instructions like that. If cmpxchg local is slower than a group of instructions to do the same then there is an obvious question to the cpu architects why we would need the instruction at all (aside from the fact that we do not need a restartable sequence for these instructions).
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > > I think the proper way to think about gs and fs on x86 is as base > > registers. They are essentially values in registers added to the address > > generated in an instruction. As such the approach is transferable to other > > processor architecture. Many support base register and base register > > relative processing. If a processor can do RMV instructions base register > > relative then you have something similar. > > How would you do it on ARM32 ? Actually you do not really need RMV instructions. The data is cpu specific so within a restartable sequence you would have exclusive access right? F.e. a increment would be 1. Load base register relative 2. add 1 3. Store base register relative The main overhead would be the registeration of the sequence. The advantage on x86 is that you do not need a restartable sequence since a single lockless RMV instruction can do this (this_cpu_inc f.e.) > One benefit of your proposal is to lessen the number of retired instructions, > but if we take the IPC into account, it is slower than rseq in my benchmark. > What > benefits do you expect from using segment selectors and non-lock-prefixed > atomic > instructions on the fast-path ? Ultimately I wish fast increments like done by this_cpu_inc() could be implemented in an efficient way on non x86 platforms that do not have cheap instructions like that. If cmpxchg local is slower than a group of instructions to do the same then there is an obvious question to the cpu architects why we would need the instruction at all (aside from the fact that we do not need a restartable sequence for these instructions).
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 14, 2017, at 11:44 AM, Chris Lameter c...@linux.com wrote: > On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > >> On x86, yet another possible approach would be to use the gs segment >> selector to point to user-space per-cpu data. This approach performs >> similarly to the cpu id cache, but it has two disadvantages: it is >> not portable, and it is incompatible with existing applications already >> using the gs segment selector for other purposes. > > I think the proper way to think about gs and fs on x86 is as base > registers. They are essentially values in registers added to the address > generated in an instruction. As such the approach is transferable to other > processor architecture. Many support base register and base register > relative processing. If a processor can do RMV instructions base register > relative then you have something similar. How would you do it on ARM32 ? > > In a restartable sequence you could increase efficieny by avoiding full > atomic instructions. This would be similar to the lockless RMV available > on x86 then. And in that form it is portable. > > A context switch to another processors would mean that the value of the > base register has changed and that we therefore are accessing another per > cpu segment. Restarting the sequence will yield a correct result without > any reloading of registers. As a concrete example, let's try to apply your proposal on a common use-case: a compare-and-store on user-space per-cpu data. With my rseq proposal the fast-path pseudo-code boils down to: load TLS::cpu_id_start into reg_X add reg_X offset to base to find target v store pointer to TLS::rseq_cs compare reg_X against TLS::cpu_id jne abort cmp *v, value jne cmpfail store newval to *v My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration for a test-case incrementing a counter with this rseq compare-and-store sequence. Let's assume we can reserve the gs segment selector for use in user-space, and that the per-cpu data layout allows using this segment selector as offset. The compare-and-store use-case would require a "cmpxchg" instruction with a gs segment selector. A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a single global value) One benefit of your proposal is to lessen the number of retired instructions, but if we take the IPC into account, it is slower than rseq in my benchmark. What benefits do you expect from using segment selectors and non-lock-prefixed atomic instructions on the fast-path ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
- On Dec 14, 2017, at 11:44 AM, Chris Lameter c...@linux.com wrote: > On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > >> On x86, yet another possible approach would be to use the gs segment >> selector to point to user-space per-cpu data. This approach performs >> similarly to the cpu id cache, but it has two disadvantages: it is >> not portable, and it is incompatible with existing applications already >> using the gs segment selector for other purposes. > > I think the proper way to think about gs and fs on x86 is as base > registers. They are essentially values in registers added to the address > generated in an instruction. As such the approach is transferable to other > processor architecture. Many support base register and base register > relative processing. If a processor can do RMV instructions base register > relative then you have something similar. How would you do it on ARM32 ? > > In a restartable sequence you could increase efficieny by avoiding full > atomic instructions. This would be similar to the lockless RMV available > on x86 then. And in that form it is portable. > > A context switch to another processors would mean that the value of the > base register has changed and that we therefore are accessing another per > cpu segment. Restarting the sequence will yield a correct result without > any reloading of registers. As a concrete example, let's try to apply your proposal on a common use-case: a compare-and-store on user-space per-cpu data. With my rseq proposal the fast-path pseudo-code boils down to: load TLS::cpu_id_start into reg_X add reg_X offset to base to find target v store pointer to TLS::rseq_cs compare reg_X against TLS::cpu_id jne abort cmp *v, value jne cmpfail store newval to *v My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration for a test-case incrementing a counter with this rseq compare-and-store sequence. Let's assume we can reserve the gs segment selector for use in user-space, and that the per-cpu data layout allows using this segment selector as offset. The compare-and-store use-case would require a "cmpxchg" instruction with a gs segment selector. A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a single global value) One benefit of your proposal is to lessen the number of retired instructions, but if we take the IPC into account, it is slower than rseq in my benchmark. What benefits do you expect from using segment selectors and non-lock-prefixed atomic instructions on the fast-path ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > On x86, yet another possible approach would be to use the gs segment > selector to point to user-space per-cpu data. This approach performs > similarly to the cpu id cache, but it has two disadvantages: it is > not portable, and it is incompatible with existing applications already > using the gs segment selector for other purposes. I think the proper way to think about gs and fs on x86 is as base registers. They are essentially values in registers added to the address generated in an instruction. As such the approach is transferable to other processor architecture. Many support base register and base register relative processing. If a processor can do RMV instructions base register relative then you have something similar. In a restartable sequence you could increase efficieny by avoiding full atomic instructions. This would be similar to the lockless RMV available on x86 then. And in that form it is portable. A context switch to another processors would mean that the value of the base register has changed and that we therefore are accessing another per cpu segment. Restarting the sequence will yield a correct result without any reloading of registers.
Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)
On Thu, 14 Dec 2017, Mathieu Desnoyers wrote: > On x86, yet another possible approach would be to use the gs segment > selector to point to user-space per-cpu data. This approach performs > similarly to the cpu id cache, but it has two disadvantages: it is > not portable, and it is incompatible with existing applications already > using the gs segment selector for other purposes. I think the proper way to think about gs and fs on x86 is as base registers. They are essentially values in registers added to the address generated in an instruction. As such the approach is transferable to other processor architecture. Many support base register and base register relative processing. If a processor can do RMV instructions base register relative then you have something similar. In a restartable sequence you could increase efficieny by avoiding full atomic instructions. This would be similar to the lockless RMV available on x86 then. And in that form it is portable. A context switch to another processors would mean that the value of the base register has changed and that we therefore are accessing another per cpu segment. Restarting the sequence will yield a correct result without any reloading of registers.