Re: rseq event_counter field
- On Oct 30, 2017, at 2:08 PM, Kyle Huey m...@kylehuey.com wrote: > On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirskiwrote: >> Answering both emails here. >> >> Also, welcome Kyle. Kyle, how badly does rseq's proposed >> event_counter break rr? For that matter, how badly does rseq without >> an event_counter break rr? >> >> (Linus, if you care, I'm proposing that rseq is probably fine for >> 4.15, but that it should be merged without the event_counter. If >> there is a bona fide use for event_counter along with a benchmark >> showing that it makes a meaningful difference, then event_counter can >> be easily added later.) >> >> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer wrote: >>> My understanding is that one of the motivations for the event counter was to >>> make it possible to express things like "increment a per-cpu value" in C. >>> Without this abstraction it wasn't possible to ensure that the strict >>> requirements of the rseq ABI were met (contiguous restartable block, single >>> instruction for the final write). If you write your entire restartable >>> sequence in assembly there's no need for the event counter. Comparing all >>> input values as an alternative way of having non-assembly sequences seems >>> like it might have a risk of ABA style race conditions. >>> >> >> My opinion after looking at a few examples is that event_counter >> sounds useful but that it's really more of a crutch that enables >> mediocre userspace code. With event_counter, you can do roughly this: >> >> struct rseq_state rseq = begin(); >> >> int cpu = rseq_getcpu(); >> Calculate some stuff. Be very very careful to to chase pointers, >> because it's easy to do in a very very fragile way as Mathieu's >> example below does. >> >> commit(, some pointer you found, some value to write, some >> double-checks to check); >> >> With a real database, this style of programming works well. You >> probably have a progress guarantee, and the database gives you the >> semantics you want. With rseq, you don't get any of this. If the C >> code is too slow or does something (an accidental interaction with >> userfaultfd, perhaps?), then you lose forward progress. You also can >> be a bit slow in the C code, in which case you'll get more conflicts >> detected than actually exist. >> >> Without event_counter, you can use a different abstraction. In your >> percpu data, you have something like: >> >> struct rseq_lock lock; >> struct freelist_entry *head; >> >> Then you can do: >> >> int cpu = rseq_getcpu(); >> struct my_data *data = ... cpu ...; >> struct rseq_state rseq = rseq_acquire(data->lock); >> >> Calculate some stuff, being careful not to touch anything that isn't >> protected by the rseq_lock you're using. >> >> commit(, cpu, some pointer you found, some value to write, some >> double-checks to check); >> >> This variant is indeed a tiny bit slower *if you measure time by >> number if instructions* -- there are some very clever optimizations >> that save some instructions in the event_counter variant. But, in the >> cases where you actually have C code in here, I suspect that the >> difference won't matter in real code. And the counter-less variant >> can have arbitrarily slow C code or even syscalls without being >> spuriously aborted until the very short asm commit sequence starts. >> >> BUT, I think that the event_counter-free version is likely to be >> faster in real code if the library is done well. The version with >> event_counter has to load the event_counter from the rseq cacheline >> right at the beginning, and that value is used, so unless the CPU is >> really quite aggressive avoid speculating through a branch that hasn't >> gotten its dependencies into cache yet, you're likely to stall a bit >> if the per-thread rseq cacheline isn't in L1. >> >> The variant without event_counter can avoid this problem as long as >> the architecture gives a cheap way to find the CPU number. On x86, >> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on >> any CPU as long as the kernel supports per-cpu FSBASE. (That's a >> feature that isn't terribly hard to add and would IMO be quite nifty.) >> If this is done, then you still have to do: >> >> __this_thread_rseq->rseq_cs = _rseq_cs_here; >> >> but that's a *store* to the rseq cacheline. It'll go in the store >> buffer and, in the fast case, nothing ever depends on it. Hence it's >> unlikely to stall. Of course, there's still a load from the >> rseq_lock, but that shares a cacheline with the data that it protects, >> so it's basically free. >> >> IOW, I think that adding event_counter encourages user code that has a >> weaker progress guarantee, is slower in a very highly optimized >> implementation, and allows users to think less about they're doing to >> their own detriment, for the gain of saving a couple of instructions. >> I could be wrong, but I think it would be nice to see evidence that >> I'm wrong before
Re: rseq event_counter field
- On Oct 30, 2017, at 2:08 PM, Kyle Huey m...@kylehuey.com wrote: > On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski wrote: >> Answering both emails here. >> >> Also, welcome Kyle. Kyle, how badly does rseq's proposed >> event_counter break rr? For that matter, how badly does rseq without >> an event_counter break rr? >> >> (Linus, if you care, I'm proposing that rseq is probably fine for >> 4.15, but that it should be merged without the event_counter. If >> there is a bona fide use for event_counter along with a benchmark >> showing that it makes a meaningful difference, then event_counter can >> be easily added later.) >> >> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer wrote: >>> My understanding is that one of the motivations for the event counter was to >>> make it possible to express things like "increment a per-cpu value" in C. >>> Without this abstraction it wasn't possible to ensure that the strict >>> requirements of the rseq ABI were met (contiguous restartable block, single >>> instruction for the final write). If you write your entire restartable >>> sequence in assembly there's no need for the event counter. Comparing all >>> input values as an alternative way of having non-assembly sequences seems >>> like it might have a risk of ABA style race conditions. >>> >> >> My opinion after looking at a few examples is that event_counter >> sounds useful but that it's really more of a crutch that enables >> mediocre userspace code. With event_counter, you can do roughly this: >> >> struct rseq_state rseq = begin(); >> >> int cpu = rseq_getcpu(); >> Calculate some stuff. Be very very careful to to chase pointers, >> because it's easy to do in a very very fragile way as Mathieu's >> example below does. >> >> commit(, some pointer you found, some value to write, some >> double-checks to check); >> >> With a real database, this style of programming works well. You >> probably have a progress guarantee, and the database gives you the >> semantics you want. With rseq, you don't get any of this. If the C >> code is too slow or does something (an accidental interaction with >> userfaultfd, perhaps?), then you lose forward progress. You also can >> be a bit slow in the C code, in which case you'll get more conflicts >> detected than actually exist. >> >> Without event_counter, you can use a different abstraction. In your >> percpu data, you have something like: >> >> struct rseq_lock lock; >> struct freelist_entry *head; >> >> Then you can do: >> >> int cpu = rseq_getcpu(); >> struct my_data *data = ... cpu ...; >> struct rseq_state rseq = rseq_acquire(data->lock); >> >> Calculate some stuff, being careful not to touch anything that isn't >> protected by the rseq_lock you're using. >> >> commit(, cpu, some pointer you found, some value to write, some >> double-checks to check); >> >> This variant is indeed a tiny bit slower *if you measure time by >> number if instructions* -- there are some very clever optimizations >> that save some instructions in the event_counter variant. But, in the >> cases where you actually have C code in here, I suspect that the >> difference won't matter in real code. And the counter-less variant >> can have arbitrarily slow C code or even syscalls without being >> spuriously aborted until the very short asm commit sequence starts. >> >> BUT, I think that the event_counter-free version is likely to be >> faster in real code if the library is done well. The version with >> event_counter has to load the event_counter from the rseq cacheline >> right at the beginning, and that value is used, so unless the CPU is >> really quite aggressive avoid speculating through a branch that hasn't >> gotten its dependencies into cache yet, you're likely to stall a bit >> if the per-thread rseq cacheline isn't in L1. >> >> The variant without event_counter can avoid this problem as long as >> the architecture gives a cheap way to find the CPU number. On x86, >> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on >> any CPU as long as the kernel supports per-cpu FSBASE. (That's a >> feature that isn't terribly hard to add and would IMO be quite nifty.) >> If this is done, then you still have to do: >> >> __this_thread_rseq->rseq_cs = _rseq_cs_here; >> >> but that's a *store* to the rseq cacheline. It'll go in the store >> buffer and, in the fast case, nothing ever depends on it. Hence it's >> unlikely to stall. Of course, there's still a load from the >> rseq_lock, but that shares a cacheline with the data that it protects, >> so it's basically free. >> >> IOW, I think that adding event_counter encourages user code that has a >> weaker progress guarantee, is slower in a very highly optimized >> implementation, and allows users to think less about they're doing to >> their own detriment, for the gain of saving a couple of instructions. >> I could be wrong, but I think it would be nice to see evidence that >> I'm wrong before event_counter becomes enshrined
Re: rseq event_counter field
On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirskiwrote: > Answering both emails here. > > Also, welcome Kyle. Kyle, how badly does rseq's proposed > event_counter break rr? For that matter, how badly does rseq without > an event_counter break rr? > > (Linus, if you care, I'm proposing that rseq is probably fine for > 4.15, but that it should be merged without the event_counter. If > there is a bona fide use for event_counter along with a benchmark > showing that it makes a meaningful difference, then event_counter can > be easily added later.) > > On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer wrote: >> My understanding is that one of the motivations for the event counter was to >> make it possible to express things like "increment a per-cpu value" in C. >> Without this abstraction it wasn't possible to ensure that the strict >> requirements of the rseq ABI were met (contiguous restartable block, single >> instruction for the final write). If you write your entire restartable >> sequence in assembly there's no need for the event counter. Comparing all >> input values as an alternative way of having non-assembly sequences seems >> like it might have a risk of ABA style race conditions. >> > > My opinion after looking at a few examples is that event_counter > sounds useful but that it's really more of a crutch that enables > mediocre userspace code. With event_counter, you can do roughly this: > > struct rseq_state rseq = begin(); > > int cpu = rseq_getcpu(); > Calculate some stuff. Be very very careful to to chase pointers, > because it's easy to do in a very very fragile way as Mathieu's > example below does. > > commit(, some pointer you found, some value to write, some > double-checks to check); > > With a real database, this style of programming works well. You > probably have a progress guarantee, and the database gives you the > semantics you want. With rseq, you don't get any of this. If the C > code is too slow or does something (an accidental interaction with > userfaultfd, perhaps?), then you lose forward progress. You also can > be a bit slow in the C code, in which case you'll get more conflicts > detected than actually exist. > > Without event_counter, you can use a different abstraction. In your > percpu data, you have something like: > > struct rseq_lock lock; > struct freelist_entry *head; > > Then you can do: > > int cpu = rseq_getcpu(); > struct my_data *data = ... cpu ...; > struct rseq_state rseq = rseq_acquire(data->lock); > > Calculate some stuff, being careful not to touch anything that isn't > protected by the rseq_lock you're using. > > commit(, cpu, some pointer you found, some value to write, some > double-checks to check); > > This variant is indeed a tiny bit slower *if you measure time by > number if instructions* -- there are some very clever optimizations > that save some instructions in the event_counter variant. But, in the > cases where you actually have C code in here, I suspect that the > difference won't matter in real code. And the counter-less variant > can have arbitrarily slow C code or even syscalls without being > spuriously aborted until the very short asm commit sequence starts. > > BUT, I think that the event_counter-free version is likely to be > faster in real code if the library is done well. The version with > event_counter has to load the event_counter from the rseq cacheline > right at the beginning, and that value is used, so unless the CPU is > really quite aggressive avoid speculating through a branch that hasn't > gotten its dependencies into cache yet, you're likely to stall a bit > if the per-thread rseq cacheline isn't in L1. > > The variant without event_counter can avoid this problem as long as > the architecture gives a cheap way to find the CPU number. On x86, > this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on > any CPU as long as the kernel supports per-cpu FSBASE. (That's a > feature that isn't terribly hard to add and would IMO be quite nifty.) > If this is done, then you still have to do: > > __this_thread_rseq->rseq_cs = _rseq_cs_here; > > but that's a *store* to the rseq cacheline. It'll go in the store > buffer and, in the fast case, nothing ever depends on it. Hence it's > unlikely to stall. Of course, there's still a load from the > rseq_lock, but that shares a cacheline with the data that it protects, > so it's basically free. > > IOW, I think that adding event_counter encourages user code that has a > weaker progress guarantee, is slower in a very highly optimized > implementation, and allows users to think less about they're doing to > their own detriment, for the gain of saving a couple of instructions. > I could be wrong, but I think it would be nice to see evidence that > I'm wrong before event_counter becomes enshrined in the ABI. Also, I > suspect that neat tools like rr will have a much easier time dealing > with rseq if event_counter isn't involved. >
Re: rseq event_counter field
On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski wrote: > Answering both emails here. > > Also, welcome Kyle. Kyle, how badly does rseq's proposed > event_counter break rr? For that matter, how badly does rseq without > an event_counter break rr? > > (Linus, if you care, I'm proposing that rseq is probably fine for > 4.15, but that it should be merged without the event_counter. If > there is a bona fide use for event_counter along with a benchmark > showing that it makes a meaningful difference, then event_counter can > be easily added later.) > > On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer wrote: >> My understanding is that one of the motivations for the event counter was to >> make it possible to express things like "increment a per-cpu value" in C. >> Without this abstraction it wasn't possible to ensure that the strict >> requirements of the rseq ABI were met (contiguous restartable block, single >> instruction for the final write). If you write your entire restartable >> sequence in assembly there's no need for the event counter. Comparing all >> input values as an alternative way of having non-assembly sequences seems >> like it might have a risk of ABA style race conditions. >> > > My opinion after looking at a few examples is that event_counter > sounds useful but that it's really more of a crutch that enables > mediocre userspace code. With event_counter, you can do roughly this: > > struct rseq_state rseq = begin(); > > int cpu = rseq_getcpu(); > Calculate some stuff. Be very very careful to to chase pointers, > because it's easy to do in a very very fragile way as Mathieu's > example below does. > > commit(, some pointer you found, some value to write, some > double-checks to check); > > With a real database, this style of programming works well. You > probably have a progress guarantee, and the database gives you the > semantics you want. With rseq, you don't get any of this. If the C > code is too slow or does something (an accidental interaction with > userfaultfd, perhaps?), then you lose forward progress. You also can > be a bit slow in the C code, in which case you'll get more conflicts > detected than actually exist. > > Without event_counter, you can use a different abstraction. In your > percpu data, you have something like: > > struct rseq_lock lock; > struct freelist_entry *head; > > Then you can do: > > int cpu = rseq_getcpu(); > struct my_data *data = ... cpu ...; > struct rseq_state rseq = rseq_acquire(data->lock); > > Calculate some stuff, being careful not to touch anything that isn't > protected by the rseq_lock you're using. > > commit(, cpu, some pointer you found, some value to write, some > double-checks to check); > > This variant is indeed a tiny bit slower *if you measure time by > number if instructions* -- there are some very clever optimizations > that save some instructions in the event_counter variant. But, in the > cases where you actually have C code in here, I suspect that the > difference won't matter in real code. And the counter-less variant > can have arbitrarily slow C code or even syscalls without being > spuriously aborted until the very short asm commit sequence starts. > > BUT, I think that the event_counter-free version is likely to be > faster in real code if the library is done well. The version with > event_counter has to load the event_counter from the rseq cacheline > right at the beginning, and that value is used, so unless the CPU is > really quite aggressive avoid speculating through a branch that hasn't > gotten its dependencies into cache yet, you're likely to stall a bit > if the per-thread rseq cacheline isn't in L1. > > The variant without event_counter can avoid this problem as long as > the architecture gives a cheap way to find the CPU number. On x86, > this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on > any CPU as long as the kernel supports per-cpu FSBASE. (That's a > feature that isn't terribly hard to add and would IMO be quite nifty.) > If this is done, then you still have to do: > > __this_thread_rseq->rseq_cs = _rseq_cs_here; > > but that's a *store* to the rseq cacheline. It'll go in the store > buffer and, in the fast case, nothing ever depends on it. Hence it's > unlikely to stall. Of course, there's still a load from the > rseq_lock, but that shares a cacheline with the data that it protects, > so it's basically free. > > IOW, I think that adding event_counter encourages user code that has a > weaker progress guarantee, is slower in a very highly optimized > implementation, and allows users to think less about they're doing to > their own detriment, for the gain of saving a couple of instructions. > I could be wrong, but I think it would be nice to see evidence that > I'm wrong before event_counter becomes enshrined in the ABI. Also, I > suspect that neat tools like rr will have a much easier time dealing > with rseq if event_counter isn't involved. > >> >> RDPID is interesting, but it
Re: rseq event_counter field
- On Oct 28, 2017, at 10:20 PM, Andy Lutomirski l...@kernel.org wrote: > Answering both emails here. > > Also, welcome Kyle. Kyle, how badly does rseq's proposed > event_counter break rr? For that matter, how badly does rseq without > an event_counter break rr? > > (Linus, if you care, I'm proposing that rseq is probably fine for > 4.15, but that it should be merged without the event_counter. If > there is a bona fide use for event_counter along with a benchmark > showing that it makes a meaningful difference, then event_counter can > be easily added later.) > > On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurerwrote: >> My understanding is that one of the motivations for the event counter was to >> make it possible to express things like "increment a per-cpu value" in C. >> Without this abstraction it wasn't possible to ensure that the strict >> requirements of the rseq ABI were met (contiguous restartable block, single >> instruction for the final write). If you write your entire restartable >> sequence in assembly there's no need for the event counter. Comparing all >> input values as an alternative way of having non-assembly sequences seems >> like it might have a risk of ABA style race conditions. >> > > My opinion after looking at a few examples is that event_counter > sounds useful but that it's really more of a crutch that enables > mediocre userspace code. With event_counter, you can do roughly this: > > struct rseq_state rseq = begin(); > > int cpu = rseq_getcpu(); > Calculate some stuff. Be very very careful to to chase pointers, > because it's easy to do in a very very fragile way as Mathieu's > example below does. > > commit(, some pointer you found, some value to write, some > double-checks to check); > > With a real database, this style of programming works well. You > probably have a progress guarantee, and the database gives you the > semantics you want. With rseq, you don't get any of this. If the C > code is too slow or does something (an accidental interaction with > userfaultfd, perhaps?), then you lose forward progress. You also can > be a bit slow in the C code, in which case you'll get more conflicts > detected than actually exist. > > Without event_counter, you can use a different abstraction. In your > percpu data, you have something like: > > struct rseq_lock lock; > struct freelist_entry *head; > > Then you can do: > > int cpu = rseq_getcpu(); > struct my_data *data = ... cpu ...; > struct rseq_state rseq = rseq_acquire(data->lock); > > Calculate some stuff, being careful not to touch anything that isn't > protected by the rseq_lock you're using. > > commit(, cpu, some pointer you found, some value to write, some > double-checks to check); > > This variant is indeed a tiny bit slower *if you measure time by > number if instructions* -- there are some very clever optimizations > that save some instructions in the event_counter variant. But, in the > cases where you actually have C code in here, I suspect that the > difference won't matter in real code. And the counter-less variant > can have arbitrarily slow C code or even syscalls without being > spuriously aborted until the very short asm commit sequence starts. > > BUT, I think that the event_counter-free version is likely to be > faster in real code if the library is done well. The version with > event_counter has to load the event_counter from the rseq cacheline > right at the beginning, and that value is used, so unless the CPU is > really quite aggressive avoid speculating through a branch that hasn't > gotten its dependencies into cache yet, you're likely to stall a bit > if the per-thread rseq cacheline isn't in L1. > > The variant without event_counter can avoid this problem as long as > the architecture gives a cheap way to find the CPU number. On x86, > this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on > any CPU as long as the kernel supports per-cpu FSBASE. (That's a > feature that isn't terribly hard to add and would IMO be quite nifty.) > If this is done, then you still have to do: > > __this_thread_rseq->rseq_cs = _rseq_cs_here; > > but that's a *store* to the rseq cacheline. It'll go in the store > buffer and, in the fast case, nothing ever depends on it. Hence it's > unlikely to stall. Of course, there's still a load from the > rseq_lock, but that shares a cacheline with the data that it protects, > so it's basically free. > > IOW, I think that adding event_counter encourages user code that has a > weaker progress guarantee, is slower in a very highly optimized > implementation, and allows users to think less about they're doing to > their own detriment, for the gain of saving a couple of instructions. > I could be wrong, but I think it would be nice to see evidence that > I'm wrong before event_counter becomes enshrined in the ABI. Also, I > suspect that neat tools like rr will have a much easier time dealing > with rseq if
Re: rseq event_counter field
- On Oct 28, 2017, at 10:20 PM, Andy Lutomirski l...@kernel.org wrote: > Answering both emails here. > > Also, welcome Kyle. Kyle, how badly does rseq's proposed > event_counter break rr? For that matter, how badly does rseq without > an event_counter break rr? > > (Linus, if you care, I'm proposing that rseq is probably fine for > 4.15, but that it should be merged without the event_counter. If > there is a bona fide use for event_counter along with a benchmark > showing that it makes a meaningful difference, then event_counter can > be easily added later.) > > On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer wrote: >> My understanding is that one of the motivations for the event counter was to >> make it possible to express things like "increment a per-cpu value" in C. >> Without this abstraction it wasn't possible to ensure that the strict >> requirements of the rseq ABI were met (contiguous restartable block, single >> instruction for the final write). If you write your entire restartable >> sequence in assembly there's no need for the event counter. Comparing all >> input values as an alternative way of having non-assembly sequences seems >> like it might have a risk of ABA style race conditions. >> > > My opinion after looking at a few examples is that event_counter > sounds useful but that it's really more of a crutch that enables > mediocre userspace code. With event_counter, you can do roughly this: > > struct rseq_state rseq = begin(); > > int cpu = rseq_getcpu(); > Calculate some stuff. Be very very careful to to chase pointers, > because it's easy to do in a very very fragile way as Mathieu's > example below does. > > commit(, some pointer you found, some value to write, some > double-checks to check); > > With a real database, this style of programming works well. You > probably have a progress guarantee, and the database gives you the > semantics you want. With rseq, you don't get any of this. If the C > code is too slow or does something (an accidental interaction with > userfaultfd, perhaps?), then you lose forward progress. You also can > be a bit slow in the C code, in which case you'll get more conflicts > detected than actually exist. > > Without event_counter, you can use a different abstraction. In your > percpu data, you have something like: > > struct rseq_lock lock; > struct freelist_entry *head; > > Then you can do: > > int cpu = rseq_getcpu(); > struct my_data *data = ... cpu ...; > struct rseq_state rseq = rseq_acquire(data->lock); > > Calculate some stuff, being careful not to touch anything that isn't > protected by the rseq_lock you're using. > > commit(, cpu, some pointer you found, some value to write, some > double-checks to check); > > This variant is indeed a tiny bit slower *if you measure time by > number if instructions* -- there are some very clever optimizations > that save some instructions in the event_counter variant. But, in the > cases where you actually have C code in here, I suspect that the > difference won't matter in real code. And the counter-less variant > can have arbitrarily slow C code or even syscalls without being > spuriously aborted until the very short asm commit sequence starts. > > BUT, I think that the event_counter-free version is likely to be > faster in real code if the library is done well. The version with > event_counter has to load the event_counter from the rseq cacheline > right at the beginning, and that value is used, so unless the CPU is > really quite aggressive avoid speculating through a branch that hasn't > gotten its dependencies into cache yet, you're likely to stall a bit > if the per-thread rseq cacheline isn't in L1. > > The variant without event_counter can avoid this problem as long as > the architecture gives a cheap way to find the CPU number. On x86, > this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on > any CPU as long as the kernel supports per-cpu FSBASE. (That's a > feature that isn't terribly hard to add and would IMO be quite nifty.) > If this is done, then you still have to do: > > __this_thread_rseq->rseq_cs = _rseq_cs_here; > > but that's a *store* to the rseq cacheline. It'll go in the store > buffer and, in the fast case, nothing ever depends on it. Hence it's > unlikely to stall. Of course, there's still a load from the > rseq_lock, but that shares a cacheline with the data that it protects, > so it's basically free. > > IOW, I think that adding event_counter encourages user code that has a > weaker progress guarantee, is slower in a very highly optimized > implementation, and allows users to think less about they're doing to > their own detriment, for the gain of saving a couple of instructions. > I could be wrong, but I think it would be nice to see evidence that > I'm wrong before event_counter becomes enshrined in the ABI. Also, I > suspect that neat tools like rr will have a much easier time dealing > with rseq if event_counter isn't
Re: rseq event_counter field
Answering both emails here. Also, welcome Kyle. Kyle, how badly does rseq's proposed event_counter break rr? For that matter, how badly does rseq without an event_counter break rr? (Linus, if you care, I'm proposing that rseq is probably fine for 4.15, but that it should be merged without the event_counter. If there is a bona fide use for event_counter along with a benchmark showing that it makes a meaningful difference, then event_counter can be easily added later.) On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurerwrote: > My understanding is that one of the motivations for the event counter was to > make it possible to express things like "increment a per-cpu value" in C. > Without this abstraction it wasn't possible to ensure that the strict > requirements of the rseq ABI were met (contiguous restartable block, single > instruction for the final write). If you write your entire restartable > sequence in assembly there's no need for the event counter. Comparing all > input values as an alternative way of having non-assembly sequences seems > like it might have a risk of ABA style race conditions. > My opinion after looking at a few examples is that event_counter sounds useful but that it's really more of a crutch that enables mediocre userspace code. With event_counter, you can do roughly this: struct rseq_state rseq = begin(); int cpu = rseq_getcpu(); Calculate some stuff. Be very very careful to to chase pointers, because it's easy to do in a very very fragile way as Mathieu's example below does. commit(, some pointer you found, some value to write, some double-checks to check); With a real database, this style of programming works well. You probably have a progress guarantee, and the database gives you the semantics you want. With rseq, you don't get any of this. If the C code is too slow or does something (an accidental interaction with userfaultfd, perhaps?), then you lose forward progress. You also can be a bit slow in the C code, in which case you'll get more conflicts detected than actually exist. Without event_counter, you can use a different abstraction. In your percpu data, you have something like: struct rseq_lock lock; struct freelist_entry *head; Then you can do: int cpu = rseq_getcpu(); struct my_data *data = ... cpu ...; struct rseq_state rseq = rseq_acquire(data->lock); Calculate some stuff, being careful not to touch anything that isn't protected by the rseq_lock you're using. commit(, cpu, some pointer you found, some value to write, some double-checks to check); This variant is indeed a tiny bit slower *if you measure time by number if instructions* -- there are some very clever optimizations that save some instructions in the event_counter variant. But, in the cases where you actually have C code in here, I suspect that the difference won't matter in real code. And the counter-less variant can have arbitrarily slow C code or even syscalls without being spuriously aborted until the very short asm commit sequence starts. BUT, I think that the event_counter-free version is likely to be faster in real code if the library is done well. The version with event_counter has to load the event_counter from the rseq cacheline right at the beginning, and that value is used, so unless the CPU is really quite aggressive avoid speculating through a branch that hasn't gotten its dependencies into cache yet, you're likely to stall a bit if the per-thread rseq cacheline isn't in L1. The variant without event_counter can avoid this problem as long as the architecture gives a cheap way to find the CPU number. On x86, this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on any CPU as long as the kernel supports per-cpu FSBASE. (That's a feature that isn't terribly hard to add and would IMO be quite nifty.) If this is done, then you still have to do: __this_thread_rseq->rseq_cs = _rseq_cs_here; but that's a *store* to the rseq cacheline. It'll go in the store buffer and, in the fast case, nothing ever depends on it. Hence it's unlikely to stall. Of course, there's still a load from the rseq_lock, but that shares a cacheline with the data that it protects, so it's basically free. IOW, I think that adding event_counter encourages user code that has a weaker progress guarantee, is slower in a very highly optimized implementation, and allows users to think less about they're doing to their own detriment, for the gain of saving a couple of instructions. I could be wrong, but I think it would be nice to see evidence that I'm wrong before event_counter becomes enshrined in the ABI. Also, I suspect that neat tools like rr will have a much easier time dealing with rseq if event_counter isn't involved. > > RDPID is interesting, but it seems like given the current ABI (which > requires setting the current restartable range) you'd still need some sort > of TLS value. Have any benchmarks of RDPID perf vs TLS? That's the wrong comparison. For a good
Re: rseq event_counter field
Answering both emails here. Also, welcome Kyle. Kyle, how badly does rseq's proposed event_counter break rr? For that matter, how badly does rseq without an event_counter break rr? (Linus, if you care, I'm proposing that rseq is probably fine for 4.15, but that it should be merged without the event_counter. If there is a bona fide use for event_counter along with a benchmark showing that it makes a meaningful difference, then event_counter can be easily added later.) On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer wrote: > My understanding is that one of the motivations for the event counter was to > make it possible to express things like "increment a per-cpu value" in C. > Without this abstraction it wasn't possible to ensure that the strict > requirements of the rseq ABI were met (contiguous restartable block, single > instruction for the final write). If you write your entire restartable > sequence in assembly there's no need for the event counter. Comparing all > input values as an alternative way of having non-assembly sequences seems > like it might have a risk of ABA style race conditions. > My opinion after looking at a few examples is that event_counter sounds useful but that it's really more of a crutch that enables mediocre userspace code. With event_counter, you can do roughly this: struct rseq_state rseq = begin(); int cpu = rseq_getcpu(); Calculate some stuff. Be very very careful to to chase pointers, because it's easy to do in a very very fragile way as Mathieu's example below does. commit(, some pointer you found, some value to write, some double-checks to check); With a real database, this style of programming works well. You probably have a progress guarantee, and the database gives you the semantics you want. With rseq, you don't get any of this. If the C code is too slow or does something (an accidental interaction with userfaultfd, perhaps?), then you lose forward progress. You also can be a bit slow in the C code, in which case you'll get more conflicts detected than actually exist. Without event_counter, you can use a different abstraction. In your percpu data, you have something like: struct rseq_lock lock; struct freelist_entry *head; Then you can do: int cpu = rseq_getcpu(); struct my_data *data = ... cpu ...; struct rseq_state rseq = rseq_acquire(data->lock); Calculate some stuff, being careful not to touch anything that isn't protected by the rseq_lock you're using. commit(, cpu, some pointer you found, some value to write, some double-checks to check); This variant is indeed a tiny bit slower *if you measure time by number if instructions* -- there are some very clever optimizations that save some instructions in the event_counter variant. But, in the cases where you actually have C code in here, I suspect that the difference won't matter in real code. And the counter-less variant can have arbitrarily slow C code or even syscalls without being spuriously aborted until the very short asm commit sequence starts. BUT, I think that the event_counter-free version is likely to be faster in real code if the library is done well. The version with event_counter has to load the event_counter from the rseq cacheline right at the beginning, and that value is used, so unless the CPU is really quite aggressive avoid speculating through a branch that hasn't gotten its dependencies into cache yet, you're likely to stall a bit if the per-thread rseq cacheline isn't in L1. The variant without event_counter can avoid this problem as long as the architecture gives a cheap way to find the CPU number. On x86, this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on any CPU as long as the kernel supports per-cpu FSBASE. (That's a feature that isn't terribly hard to add and would IMO be quite nifty.) If this is done, then you still have to do: __this_thread_rseq->rseq_cs = _rseq_cs_here; but that's a *store* to the rseq cacheline. It'll go in the store buffer and, in the fast case, nothing ever depends on it. Hence it's unlikely to stall. Of course, there's still a load from the rseq_lock, but that shares a cacheline with the data that it protects, so it's basically free. IOW, I think that adding event_counter encourages user code that has a weaker progress guarantee, is slower in a very highly optimized implementation, and allows users to think less about they're doing to their own detriment, for the gain of saving a couple of instructions. I could be wrong, but I think it would be nice to see evidence that I'm wrong before event_counter becomes enshrined in the ABI. Also, I suspect that neat tools like rr will have a much easier time dealing with rseq if event_counter isn't involved. > > RDPID is interesting, but it seems like given the current ABI (which > requires setting the current restartable range) you'd still need some sort > of TLS value. Have any benchmarks of RDPID perf vs TLS? That's the wrong comparison. For a good userspace