Re: rseq event_counter field

2017-10-30 Thread Mathieu Desnoyers
- On Oct 30, 2017, at 2:08 PM, Kyle Huey m...@kylehuey.com wrote:

> On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski  wrote:
>> Answering both emails here.
>>
>> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
>> event_counter break rr?  For that matter, how badly does rseq without
>> an event_counter break rr?
>>
>> (Linus, if you care, I'm proposing that rseq is probably fine for
>> 4.15, but that it should be merged without the event_counter.  If
>> there is a bona fide use for event_counter along with a benchmark
>> showing that it makes a meaningful difference, then event_counter can
>> be easily added later.)
>>
>> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>>> My understanding is that one of the motivations for the event counter was to
>>> make it possible to express things like "increment a per-cpu value" in C.
>>> Without this abstraction it wasn't possible to ensure that the strict
>>> requirements of the rseq ABI were met (contiguous restartable block, single
>>> instruction for the final write). If you write your entire restartable
>>> sequence in assembly there's no need for the event counter. Comparing all
>>> input values as an alternative way of having non-assembly sequences seems
>>> like it might have a risk of ABA style race conditions.
>>>
>>
>> My opinion after looking at a few examples is that event_counter
>> sounds useful but that it's really more of a crutch that enables
>> mediocre userspace code.  With event_counter, you can do roughly this:
>>
>> struct rseq_state rseq = begin();
>>
>> int cpu = rseq_getcpu();
>> Calculate some stuff.  Be very very careful to to chase pointers,
>> because it's easy to do in a very very fragile way as Mathieu's
>> example below does.
>>
>> commit(, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> With a real database, this style of programming works well.  You
>> probably have a progress guarantee, and the database gives you the
>> semantics you want.  With rseq, you don't get any of this.  If the C
>> code is too slow or does something (an accidental interaction with
>> userfaultfd, perhaps?), then you lose forward progress.  You also can
>> be a bit slow in the C code, in which case you'll get more conflicts
>> detected than actually exist.
>>
>> Without event_counter, you can use a different abstraction.  In your
>> percpu data, you have something like:
>>
>> struct rseq_lock lock;
>> struct freelist_entry *head;
>>
>> Then you can do:
>>
>> int cpu = rseq_getcpu();
>> struct my_data *data = ... cpu ...;
>> struct rseq_state rseq = rseq_acquire(data->lock);
>>
>> Calculate some stuff, being careful not to touch anything that isn't
>> protected by the rseq_lock you're using.
>>
>> commit(, cpu, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> This variant is indeed a tiny bit slower *if you measure time by
>> number if instructions* -- there are some very clever optimizations
>> that save some instructions in the event_counter variant.  But, in the
>> cases where you actually have C code in here, I suspect that the
>> difference won't matter in real code.  And the counter-less variant
>> can have arbitrarily slow C code or even syscalls without being
>> spuriously aborted until the very short asm commit sequence starts.
>>
>> BUT, I think that the event_counter-free version is likely to be
>> faster in real code if the library is done well.  The version with
>> event_counter has to load the event_counter from the rseq cacheline
>> right at the beginning, and that value is used, so unless the CPU is
>> really quite aggressive avoid speculating through a branch that hasn't
>> gotten its dependencies into cache yet, you're likely to stall a bit
>> if the per-thread rseq cacheline isn't in L1.
>>
>> The variant without event_counter can avoid this problem as long as
>> the architecture gives a cheap way to find the CPU number.  On x86,
>> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
>> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
>> feature that isn't terribly hard to add and would IMO be quite nifty.)
>>  If this is done, then you still have to do:
>>
>> __this_thread_rseq->rseq_cs = _rseq_cs_here;
>>
>> but that's a *store* to the rseq cacheline.  It'll go in the store
>> buffer and, in the fast case, nothing ever depends on it.  Hence it's
>> unlikely to stall.  Of course, there's still a load from the
>> rseq_lock, but that shares a cacheline with the data that it protects,
>> so it's basically free.
>>
>> IOW, I think that adding event_counter encourages user code that has a
>> weaker progress guarantee, is slower in a very highly optimized
>> implementation, and allows users to think less about they're doing to
>> their own detriment, for the gain of saving a couple of instructions.
>> I could be wrong, but I think it would be nice to see evidence that
>> I'm wrong before 

Re: rseq event_counter field

2017-10-30 Thread Mathieu Desnoyers
- On Oct 30, 2017, at 2:08 PM, Kyle Huey m...@kylehuey.com wrote:

> On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski  wrote:
>> Answering both emails here.
>>
>> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
>> event_counter break rr?  For that matter, how badly does rseq without
>> an event_counter break rr?
>>
>> (Linus, if you care, I'm proposing that rseq is probably fine for
>> 4.15, but that it should be merged without the event_counter.  If
>> there is a bona fide use for event_counter along with a benchmark
>> showing that it makes a meaningful difference, then event_counter can
>> be easily added later.)
>>
>> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>>> My understanding is that one of the motivations for the event counter was to
>>> make it possible to express things like "increment a per-cpu value" in C.
>>> Without this abstraction it wasn't possible to ensure that the strict
>>> requirements of the rseq ABI were met (contiguous restartable block, single
>>> instruction for the final write). If you write your entire restartable
>>> sequence in assembly there's no need for the event counter. Comparing all
>>> input values as an alternative way of having non-assembly sequences seems
>>> like it might have a risk of ABA style race conditions.
>>>
>>
>> My opinion after looking at a few examples is that event_counter
>> sounds useful but that it's really more of a crutch that enables
>> mediocre userspace code.  With event_counter, you can do roughly this:
>>
>> struct rseq_state rseq = begin();
>>
>> int cpu = rseq_getcpu();
>> Calculate some stuff.  Be very very careful to to chase pointers,
>> because it's easy to do in a very very fragile way as Mathieu's
>> example below does.
>>
>> commit(, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> With a real database, this style of programming works well.  You
>> probably have a progress guarantee, and the database gives you the
>> semantics you want.  With rseq, you don't get any of this.  If the C
>> code is too slow or does something (an accidental interaction with
>> userfaultfd, perhaps?), then you lose forward progress.  You also can
>> be a bit slow in the C code, in which case you'll get more conflicts
>> detected than actually exist.
>>
>> Without event_counter, you can use a different abstraction.  In your
>> percpu data, you have something like:
>>
>> struct rseq_lock lock;
>> struct freelist_entry *head;
>>
>> Then you can do:
>>
>> int cpu = rseq_getcpu();
>> struct my_data *data = ... cpu ...;
>> struct rseq_state rseq = rseq_acquire(data->lock);
>>
>> Calculate some stuff, being careful not to touch anything that isn't
>> protected by the rseq_lock you're using.
>>
>> commit(, cpu, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> This variant is indeed a tiny bit slower *if you measure time by
>> number if instructions* -- there are some very clever optimizations
>> that save some instructions in the event_counter variant.  But, in the
>> cases where you actually have C code in here, I suspect that the
>> difference won't matter in real code.  And the counter-less variant
>> can have arbitrarily slow C code or even syscalls without being
>> spuriously aborted until the very short asm commit sequence starts.
>>
>> BUT, I think that the event_counter-free version is likely to be
>> faster in real code if the library is done well.  The version with
>> event_counter has to load the event_counter from the rseq cacheline
>> right at the beginning, and that value is used, so unless the CPU is
>> really quite aggressive avoid speculating through a branch that hasn't
>> gotten its dependencies into cache yet, you're likely to stall a bit
>> if the per-thread rseq cacheline isn't in L1.
>>
>> The variant without event_counter can avoid this problem as long as
>> the architecture gives a cheap way to find the CPU number.  On x86,
>> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
>> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
>> feature that isn't terribly hard to add and would IMO be quite nifty.)
>>  If this is done, then you still have to do:
>>
>> __this_thread_rseq->rseq_cs = _rseq_cs_here;
>>
>> but that's a *store* to the rseq cacheline.  It'll go in the store
>> buffer and, in the fast case, nothing ever depends on it.  Hence it's
>> unlikely to stall.  Of course, there's still a load from the
>> rseq_lock, but that shares a cacheline with the data that it protects,
>> so it's basically free.
>>
>> IOW, I think that adding event_counter encourages user code that has a
>> weaker progress guarantee, is slower in a very highly optimized
>> implementation, and allows users to think less about they're doing to
>> their own detriment, for the gain of saving a couple of instructions.
>> I could be wrong, but I think it would be nice to see evidence that
>> I'm wrong before event_counter becomes enshrined 

Re: rseq event_counter field

2017-10-30 Thread Kyle Huey
On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski  wrote:
> Answering both emails here.
>
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
>
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
>
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
>
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
>
> struct rseq_state rseq = begin();
>
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
>
> commit(, some pointer you found, some value to write, some
> double-checks to check);
>
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
>
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
>
> struct rseq_lock lock;
> struct freelist_entry *head;
>
> Then you can do:
>
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
>
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
>
> commit(, cpu, some pointer you found, some value to write, some
> double-checks to check);
>
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
>
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
>
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
>  If this is done, then you still have to do:
>
> __this_thread_rseq->rseq_cs = _rseq_cs_here;
>
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
>
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
>

Re: rseq event_counter field

2017-10-30 Thread Kyle Huey
On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski  wrote:
> Answering both emails here.
>
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
>
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
>
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
>
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
>
> struct rseq_state rseq = begin();
>
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
>
> commit(, some pointer you found, some value to write, some
> double-checks to check);
>
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
>
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
>
> struct rseq_lock lock;
> struct freelist_entry *head;
>
> Then you can do:
>
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
>
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
>
> commit(, cpu, some pointer you found, some value to write, some
> double-checks to check);
>
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
>
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
>
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
>  If this is done, then you still have to do:
>
> __this_thread_rseq->rseq_cs = _rseq_cs_here;
>
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
>
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
>
>>
>> RDPID is interesting, but it 

Re: rseq event_counter field

2017-10-28 Thread Mathieu Desnoyers
- On Oct 28, 2017, at 10:20 PM, Andy Lutomirski l...@kernel.org wrote:

> Answering both emails here.
> 
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
> 
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
> 
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
> 
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
> 
> struct rseq_state rseq = begin();
> 
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
> 
> commit(, some pointer you found, some value to write, some
> double-checks to check);
> 
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
> 
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
> 
> struct rseq_lock lock;
> struct freelist_entry *head;
> 
> Then you can do:
> 
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
> 
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
> 
> commit(, cpu, some pointer you found, some value to write, some
> double-checks to check);
> 
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
> 
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
> 
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
> If this is done, then you still have to do:
> 
> __this_thread_rseq->rseq_cs = _rseq_cs_here;
> 
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
> 
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if 

Re: rseq event_counter field

2017-10-28 Thread Mathieu Desnoyers
- On Oct 28, 2017, at 10:20 PM, Andy Lutomirski l...@kernel.org wrote:

> Answering both emails here.
> 
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
> 
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
> 
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
> 
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
> 
> struct rseq_state rseq = begin();
> 
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
> 
> commit(, some pointer you found, some value to write, some
> double-checks to check);
> 
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
> 
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
> 
> struct rseq_lock lock;
> struct freelist_entry *head;
> 
> Then you can do:
> 
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
> 
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
> 
> commit(, cpu, some pointer you found, some value to write, some
> double-checks to check);
> 
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
> 
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
> 
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
> If this is done, then you still have to do:
> 
> __this_thread_rseq->rseq_cs = _rseq_cs_here;
> 
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
> 
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't 

Re: rseq event_counter field

2017-10-28 Thread Andy Lutomirski
Answering both emails here.

Also, welcome Kyle.  Kyle, how badly does rseq's proposed
event_counter break rr?  For that matter, how badly does rseq without
an event_counter break rr?

(Linus, if you care, I'm proposing that rseq is probably fine for
4.15, but that it should be merged without the event_counter.  If
there is a bona fide use for event_counter along with a benchmark
showing that it makes a meaningful difference, then event_counter can
be easily added later.)

On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
> My understanding is that one of the motivations for the event counter was to
> make it possible to express things like "increment a per-cpu value" in C.
> Without this abstraction it wasn't possible to ensure that the strict
> requirements of the rseq ABI were met (contiguous restartable block, single
> instruction for the final write). If you write your entire restartable
> sequence in assembly there's no need for the event counter. Comparing all
> input values as an alternative way of having non-assembly sequences seems
> like it might have a risk of ABA style race conditions.
>

My opinion after looking at a few examples is that event_counter
sounds useful but that it's really more of a crutch that enables
mediocre userspace code.  With event_counter, you can do roughly this:

struct rseq_state rseq = begin();

int cpu = rseq_getcpu();
Calculate some stuff.  Be very very careful to to chase pointers,
because it's easy to do in a very very fragile way as Mathieu's
example below does.

commit(, some pointer you found, some value to write, some
double-checks to check);

With a real database, this style of programming works well.  You
probably have a progress guarantee, and the database gives you the
semantics you want.  With rseq, you don't get any of this.  If the C
code is too slow or does something (an accidental interaction with
userfaultfd, perhaps?), then you lose forward progress.  You also can
be a bit slow in the C code, in which case you'll get more conflicts
detected than actually exist.

Without event_counter, you can use a different abstraction.  In your
percpu data, you have something like:

struct rseq_lock lock;
struct freelist_entry *head;

Then you can do:

int cpu = rseq_getcpu();
struct my_data *data = ... cpu ...;
struct rseq_state rseq = rseq_acquire(data->lock);

Calculate some stuff, being careful not to touch anything that isn't
protected by the rseq_lock you're using.

commit(, cpu, some pointer you found, some value to write, some
double-checks to check);

This variant is indeed a tiny bit slower *if you measure time by
number if instructions* -- there are some very clever optimizations
that save some instructions in the event_counter variant.  But, in the
cases where you actually have C code in here, I suspect that the
difference won't matter in real code.  And the counter-less variant
can have arbitrarily slow C code or even syscalls without being
spuriously aborted until the very short asm commit sequence starts.

BUT, I think that the event_counter-free version is likely to be
faster in real code if the library is done well.  The version with
event_counter has to load the event_counter from the rseq cacheline
right at the beginning, and that value is used, so unless the CPU is
really quite aggressive avoid speculating through a branch that hasn't
gotten its dependencies into cache yet, you're likely to stall a bit
if the per-thread rseq cacheline isn't in L1.

The variant without event_counter can avoid this problem as long as
the architecture gives a cheap way to find the CPU number.  On x86,
this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
feature that isn't terribly hard to add and would IMO be quite nifty.)
 If this is done, then you still have to do:

__this_thread_rseq->rseq_cs = _rseq_cs_here;

but that's a *store* to the rseq cacheline.  It'll go in the store
buffer and, in the fast case, nothing ever depends on it.  Hence it's
unlikely to stall.  Of course, there's still a load from the
rseq_lock, but that shares a cacheline with the data that it protects,
so it's basically free.

IOW, I think that adding event_counter encourages user code that has a
weaker progress guarantee, is slower in a very highly optimized
implementation, and allows users to think less about they're doing to
their own detriment, for the gain of saving a couple of instructions.
I could be wrong, but I think it would be nice to see evidence that
I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
suspect that neat tools like rr will have a much easier time dealing
with rseq if event_counter isn't involved.

>
> RDPID is interesting, but it seems like given the current ABI (which
> requires setting the current restartable range) you'd still need some sort
> of TLS value. Have any benchmarks of RDPID perf vs TLS?

That's the wrong comparison.  For a good 

Re: rseq event_counter field

2017-10-28 Thread Andy Lutomirski
Answering both emails here.

Also, welcome Kyle.  Kyle, how badly does rseq's proposed
event_counter break rr?  For that matter, how badly does rseq without
an event_counter break rr?

(Linus, if you care, I'm proposing that rseq is probably fine for
4.15, but that it should be merged without the event_counter.  If
there is a bona fide use for event_counter along with a benchmark
showing that it makes a meaningful difference, then event_counter can
be easily added later.)

On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer  wrote:
> My understanding is that one of the motivations for the event counter was to
> make it possible to express things like "increment a per-cpu value" in C.
> Without this abstraction it wasn't possible to ensure that the strict
> requirements of the rseq ABI were met (contiguous restartable block, single
> instruction for the final write). If you write your entire restartable
> sequence in assembly there's no need for the event counter. Comparing all
> input values as an alternative way of having non-assembly sequences seems
> like it might have a risk of ABA style race conditions.
>

My opinion after looking at a few examples is that event_counter
sounds useful but that it's really more of a crutch that enables
mediocre userspace code.  With event_counter, you can do roughly this:

struct rseq_state rseq = begin();

int cpu = rseq_getcpu();
Calculate some stuff.  Be very very careful to to chase pointers,
because it's easy to do in a very very fragile way as Mathieu's
example below does.

commit(, some pointer you found, some value to write, some
double-checks to check);

With a real database, this style of programming works well.  You
probably have a progress guarantee, and the database gives you the
semantics you want.  With rseq, you don't get any of this.  If the C
code is too slow or does something (an accidental interaction with
userfaultfd, perhaps?), then you lose forward progress.  You also can
be a bit slow in the C code, in which case you'll get more conflicts
detected than actually exist.

Without event_counter, you can use a different abstraction.  In your
percpu data, you have something like:

struct rseq_lock lock;
struct freelist_entry *head;

Then you can do:

int cpu = rseq_getcpu();
struct my_data *data = ... cpu ...;
struct rseq_state rseq = rseq_acquire(data->lock);

Calculate some stuff, being careful not to touch anything that isn't
protected by the rseq_lock you're using.

commit(, cpu, some pointer you found, some value to write, some
double-checks to check);

This variant is indeed a tiny bit slower *if you measure time by
number if instructions* -- there are some very clever optimizations
that save some instructions in the event_counter variant.  But, in the
cases where you actually have C code in here, I suspect that the
difference won't matter in real code.  And the counter-less variant
can have arbitrarily slow C code or even syscalls without being
spuriously aborted until the very short asm commit sequence starts.

BUT, I think that the event_counter-free version is likely to be
faster in real code if the library is done well.  The version with
event_counter has to load the event_counter from the rseq cacheline
right at the beginning, and that value is used, so unless the CPU is
really quite aggressive avoid speculating through a branch that hasn't
gotten its dependencies into cache yet, you're likely to stall a bit
if the per-thread rseq cacheline isn't in L1.

The variant without event_counter can avoid this problem as long as
the architecture gives a cheap way to find the CPU number.  On x86,
this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
feature that isn't terribly hard to add and would IMO be quite nifty.)
 If this is done, then you still have to do:

__this_thread_rseq->rseq_cs = _rseq_cs_here;

but that's a *store* to the rseq cacheline.  It'll go in the store
buffer and, in the fast case, nothing ever depends on it.  Hence it's
unlikely to stall.  Of course, there's still a load from the
rseq_lock, but that shares a cacheline with the data that it protects,
so it's basically free.

IOW, I think that adding event_counter encourages user code that has a
weaker progress guarantee, is slower in a very highly optimized
implementation, and allows users to think less about they're doing to
their own detriment, for the gain of saving a couple of instructions.
I could be wrong, but I think it would be nice to see evidence that
I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
suspect that neat tools like rr will have a much easier time dealing
with rseq if event_counter isn't involved.

>
> RDPID is interesting, but it seems like given the current ABI (which
> requires setting the current restartable range) you'd still need some sort
> of TLS value. Have any benchmarks of RDPID perf vs TLS?

That's the wrong comparison.  For a good userspace