Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-04 Thread Ben Maurer
Hey -

I think the ideas Daniel brings up here are interesting -- specifically the 
notion that a thread could set a "pre-sleep wish" to signal it's sleeping. As 
this conversation shows I think there's a fair bit of depth to that. For 
example, the FUTEX_LOCK is an alternative approach. Another idea might be using 
the "currently running cpu" area of rseq to tell if the process in question was 
sleeping (assuming that the kernel would be modified to set this to -1 every 
time a process was unscheduled)

The idea discussed here seems orthogonal to the core thesis of rseq. I'm 
wondering if we can focus on getting rseq in, maybe with a eye for making sure 
this use case could be handled long term. My sense is that this is possible. We 
could use the flags setting in the per-thread rseq area, or maybe extend the 
meaning of the structure rseq_cs points to to signal that there was information 
about how to signal the sleeping of the current process. It seems to me this 
would be a natural way to add the functionality Daniel talks about if desired 
in the future.

Daniel - do you think there's anything we should do in the patch set today that 
would make it easier to implement your idea in the future without expanding the 
scope of the patch today. i.e. is there anything else we need to do to lay the 
framework for your idea.

I'd really love to see us get this patch in. There's a whole host of primitives 
this unlocks (more efficient RCU, better malloc implementations, fast 
reader-writer locks). I'm sure we'll have more ideas about APIs to provide once 
we've explored these use cases, but to me this patch is the MVP we need to ship 
to get that feedback. It's a solid framework that we can build upon, eg with 
the "opv" syscall or the idea in this thread if user feedback shows those 
things are necessary.

-b

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-04 Thread Ben Maurer
Hey -

I think the ideas Daniel brings up here are interesting -- specifically the 
notion that a thread could set a "pre-sleep wish" to signal it's sleeping. As 
this conversation shows I think there's a fair bit of depth to that. For 
example, the FUTEX_LOCK is an alternative approach. Another idea might be using 
the "currently running cpu" area of rseq to tell if the process in question was 
sleeping (assuming that the kernel would be modified to set this to -1 every 
time a process was unscheduled)

The idea discussed here seems orthogonal to the core thesis of rseq. I'm 
wondering if we can focus on getting rseq in, maybe with a eye for making sure 
this use case could be handled long term. My sense is that this is possible. We 
could use the flags setting in the per-thread rseq area, or maybe extend the 
meaning of the structure rseq_cs points to to signal that there was information 
about how to signal the sleeping of the current process. It seems to me this 
would be a natural way to add the functionality Daniel talks about if desired 
in the future.

Daniel - do you think there's anything we should do in the patch set today that 
would make it easier to implement your idea in the future without expanding the 
scope of the patch today. i.e. is there anything else we need to do to lay the 
framework for your idea.

I'd really love to see us get this patch in. There's a whole host of primitives 
this unlocks (more efficient RCU, better malloc implementations, fast 
reader-writer locks). I'm sure we'll have more ideas about APIs to provide once 
we've explored these use cases, but to me this patch is the MVP we need to ship 
to get that feedback. It's a solid framework that we can build upon, eg with 
the "opv" syscall or the idea in this thread if user feedback shows those 
things are necessary.

-b

Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call

2017-11-14 Thread Ben Maurer

>>>   int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t 
>>>sig);
>> 
>> Really dumb question -- and one I'm sorry to bring up at the last minute. 
>> Should
>> we consider making the syscall name something more generic 
>> "register_tls_abi"?
> I proposed that approach back in 2016 ("tls abi" system call), and the 
> feedback
> I received back then is that it was preferred to have a dedicated "rseq" 
> system
> call than an "open ended" and generic "tls abi" system call.

Ultimately I'm fine either way. I do think that in the past few months of 
review it has become clear that creating this tls abi requires a fair bit of 
work. It'd be a shame to see a future attempt to use such an ABI made difficult 
by forcing the author to figure out the registration process yet again. I 
assume the maintainers of glibc would also like to avoid the need to register 
multiple ABIs.

-b

Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call

2017-11-14 Thread Ben Maurer

>>>   int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t 
>>>sig);
>> 
>> Really dumb question -- and one I'm sorry to bring up at the last minute. 
>> Should
>> we consider making the syscall name something more generic 
>> "register_tls_abi"?
> I proposed that approach back in 2016 ("tls abi" system call), and the 
> feedback
> I received back then is that it was preferred to have a dedicated "rseq" 
> system
> call than an "open ended" and generic "tls abi" system call.

Ultimately I'm fine either way. I do think that in the past few months of 
review it has become clear that creating this tls abi requires a fair bit of 
work. It'd be a shame to see a future attempt to use such an ABI made difficult 
by forcing the author to figure out the registration process yet again. I 
assume the maintainers of glibc would also like to avoid the need to register 
multiple ABIs.

-b

Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call

2017-11-14 Thread Ben Maurer
(apologies for the duplicate email, the previous one bounced as it was 
accidentally using HTML formatting)

If I understand correctly this is run on every context switch so we probably 
want to make it really fast

> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
> +{
> +   bool need_restart = false;
> +   uint32_t flags;
> +
> +   /* Get thread flags. */
> +   if (__get_user(flags, >rseq->flags))
> +   return -EFAULT;
> +
> +   /* Take into account critical section flags. */
> +   flags |= cs_flags;
> +
> +   /*
> +* Restart on signal can only be inhibited when restart on
> +* preempt and restart on migrate are inhibited too. Otherwise,
> +* a preempted signal handler could fail to restart the prior
> +* execution context on sigreturn.
> +*/
> +   if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
> +   if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +   return -EINVAL;
> +   if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +   return -EINVAL;
> +   }

How does this error even get to userspace? Is it worth doing this switch on 
every execution?


> +   if (t->rseq_migrate
> +   && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +   need_restart = true;
> +   else if (t->rseq_preempt
> +   && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +   need_restart = true;
> +   else if (t->rseq_signal
> +   && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
> +   need_restart = true;

This could potentially be sped up by having the rseq_* fields in t use a single 
bitmask with the same bit offsets as RSEQ_CS_FLAG_NO_* then using bit 
operations to check the appropriate overlap.

Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call

2017-11-14 Thread Ben Maurer
(apologies for the duplicate email, the previous one bounced as it was 
accidentally using HTML formatting)

If I understand correctly this is run on every context switch so we probably 
want to make it really fast

> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
> +{
> +   bool need_restart = false;
> +   uint32_t flags;
> +
> +   /* Get thread flags. */
> +   if (__get_user(flags, >rseq->flags))
> +   return -EFAULT;
> +
> +   /* Take into account critical section flags. */
> +   flags |= cs_flags;
> +
> +   /*
> +* Restart on signal can only be inhibited when restart on
> +* preempt and restart on migrate are inhibited too. Otherwise,
> +* a preempted signal handler could fail to restart the prior
> +* execution context on sigreturn.
> +*/
> +   if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) {
> +   if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +   return -EINVAL;
> +   if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +   return -EINVAL;
> +   }

How does this error even get to userspace? Is it worth doing this switch on 
every execution?


> +   if (t->rseq_migrate
> +   && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE))
> +   need_restart = true;
> +   else if (t->rseq_preempt
> +   && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> +   need_restart = true;
> +   else if (t->rseq_signal
> +   && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL))
> +   need_restart = true;

This could potentially be sped up by having the rseq_* fields in t use a single 
bitmask with the same bit offsets as RSEQ_CS_FLAG_NO_* then using bit 
operations to check the appropriate overlap.

Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call

2017-11-14 Thread Ben Maurer
>   int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t 
>sig);

Really dumb question -- and one I'm sorry to bring up at the last minute. 
Should we consider making the syscall name something more generic 
"register_tls_abi"? I'm assuming that if we ever want to use a per-thread 
userspace/kernel ABI we'll want to use this field given the difficulty of 
getting adoption of registration, the need to involve glibc, etc. It seems like 
there could be future use cases of this TLS area that have nothing to do with 
rseq. 


Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call

2017-11-14 Thread Ben Maurer
>   int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t 
>sig);

Really dumb question -- and one I'm sorry to bring up at the last minute. 
Should we consider making the syscall name something more generic 
"register_tls_abi"? I'm assuming that if we ever want to use a per-thread 
userspace/kernel ABI we'll want to use this field given the difficulty of 
getting adoption of registration, the need to involve glibc, etc. It seems like 
there could be future use cases of this TLS area that have nothing to do with 
rseq. 


Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-23 Thread Ben Maurer
> if (!((long)ip - (long)start_ip <= (long)post_commit_offset))
>   return 1;

> This introduces an issue here: if "ip" is lower than "start_ip", we
> can incorrectly think we are in a critical section, when we are in
> fact not.

This shouldn't be an issue if we used unsigned numbers. Eg if start_ip is X and 
post_commit_offset is L, then (ip - X <= L) means that if ip is less than X ip 
- X will be signed, which will become a large unsigned value. 

> or to the kernel to set it back to NULL if it finds out that it is
> preempting/delivering a signal over an instruction pointer outside
> of the current rseq_cs start_ip/post_commit_ip range (lazy clear).

I see, lazy clear makes sense. Still, if during most execution periods the user 
code enters some rseq section (likely if rseq is used for something like 
malloc) on every context switch this code will have to be run.

> Moreover, this modification would add a subtraction on the common case
> (ip - start_ip), and makes the ABI slightly uglier.

We could benchmark it but the subtraction should be similar in cost to the 
extra comparison but reducing the number of branches seems like it will help as 
well. FWIW GCC attempts to translate this kind of sequence to a subtract and 
compare: https://godbolt.org/g/5DGLvo.

I agree the ABI is uglier, but since we're mucking with every context switch I 
thought I'd point it out.

> If I understand well, you are proposing to speed up .so load time by
> means of removing relocations of pointers within rseq_cs, done by
> making those relative to the rseq_cs address.

Yeah, I think this may be overkill as optimization.


Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-23 Thread Ben Maurer
> if (!((long)ip - (long)start_ip <= (long)post_commit_offset))
>   return 1;

> This introduces an issue here: if "ip" is lower than "start_ip", we
> can incorrectly think we are in a critical section, when we are in
> fact not.

This shouldn't be an issue if we used unsigned numbers. Eg if start_ip is X and 
post_commit_offset is L, then (ip - X <= L) means that if ip is less than X ip 
- X will be signed, which will become a large unsigned value. 

> or to the kernel to set it back to NULL if it finds out that it is
> preempting/delivering a signal over an instruction pointer outside
> of the current rseq_cs start_ip/post_commit_ip range (lazy clear).

I see, lazy clear makes sense. Still, if during most execution periods the user 
code enters some rseq section (likely if rseq is used for something like 
malloc) on every context switch this code will have to be run.

> Moreover, this modification would add a subtraction on the common case
> (ip - start_ip), and makes the ABI slightly uglier.

We could benchmark it but the subtraction should be similar in cost to the 
extra comparison but reducing the number of branches seems like it will help as 
well. FWIW GCC attempts to translate this kind of sequence to a subtract and 
compare: https://godbolt.org/g/5DGLvo.

I agree the ABI is uglier, but since we're mucking with every context switch I 
thought I'd point it out.

> If I understand well, you are proposing to speed up .so load time by
> means of removing relocations of pointers within rseq_cs, done by
> making those relative to the rseq_cs address.

Yeah, I think this may be overkill as optimization.


Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-18 Thread Ben Maurer
> The layout of struct rseq_cs is as follows:

> start_ip
> Instruction pointer address of the first instruction of the
> sequence of consecutive assembly instructions.

> post_commit_ip
> Instruction pointer address after the last  instruction  of
>  the sequence of consecutive assembly instructions.

>  abort_ip
> Instruction  pointer  address  where  to move the execution
>  flow in case of abort of the sequence of consecutive assem‐
>  bly instructions.

Really minor performance performance thought here.

1) In the kernel at context switch time you'd need code like:

if (ip >= start_ip && ip <= post_commit_ip)

This branch would be hard to predict because most instruction pointers would be 
either before or after. If post_commit_ip were relative to start_ip you could 
do this:

if (ip - start_ip <= post_commit_offset)

which is a single branch that would be more predictable.

2) In a shared library a rseq_cs structure would have to be relocated at 
runtime because at compilation time the final address of the library wouldn't 
be known. I'm not sure if this is important enough to address, but it could be 
solved by making the pointers relative to the address of rseq_cs. But this 
would make for an uglier API.


Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-18 Thread Ben Maurer
> The layout of struct rseq_cs is as follows:

> start_ip
> Instruction pointer address of the first instruction of the
> sequence of consecutive assembly instructions.

> post_commit_ip
> Instruction pointer address after the last  instruction  of
>  the sequence of consecutive assembly instructions.

>  abort_ip
> Instruction  pointer  address  where  to move the execution
>  flow in case of abort of the sequence of consecutive assem‐
>  bly instructions.

Really minor performance performance thought here.

1) In the kernel at context switch time you'd need code like:

if (ip >= start_ip && ip <= post_commit_ip)

This branch would be hard to predict because most instruction pointers would be 
either before or after. If post_commit_ip were relative to start_ip you could 
do this:

if (ip - start_ip <= post_commit_offset)

which is a single branch that would be more predictable.

2) In a shared library a rseq_cs structure would have to be relocated at 
runtime because at compilation time the final address of the library wouldn't 
be known. I'm not sure if this is important enough to address, but it could be 
solved by making the pointers relative to the address of rseq_cs. But this 
would make for an uglier API.


Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-17 Thread Ben Maurer
> I have a use-case for keeping the reference counting in place though. It's
> use of rseq in signal handlers.

Would this be solved by saying that the rseq api will return an error if you 
register and there's already a block registered. In this case the signal 
handler would register the rseq abi area just as the non-signal code is trying 
to do the same. The non-signal code would see this error code and realize that 
its job had been done for it and then go on it's way.

It would be unsafe for signal handler code to *unregister* the area, but I 
don't think that's necessary.

Basically we'd support a refcount of either 0 or 1, but nothing else.

If a signal handler registers the ABI area, how will it ensure the ABI is 
cleaned up at thread exit? I can't imagine pthread_key is signal safe.

Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-17 Thread Ben Maurer
> I have a use-case for keeping the reference counting in place though. It's
> use of rseq in signal handlers.

Would this be solved by saying that the rseq api will return an error if you 
register and there's already a block registered. In this case the signal 
handler would register the rseq abi area just as the non-signal code is trying 
to do the same. The non-signal code would see this error code and realize that 
its job had been done for it and then go on it's way.

It would be unsafe for signal handler code to *unregister* the area, but I 
don't think that's necessary.

Basically we'd support a refcount of either 0 or 1, but nothing else.

If a signal handler registers the ABI area, how will it ensure the ABI is 
cleaned up at thread exit? I can't imagine pthread_key is signal safe.

Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-17 Thread Ben Maurer
Hey,

> So far the restrictions I see for libraries using this symbol are:
> - They should never be unloaded,
> - They should never be loaded with dlopen RTLD_LOCAL flag.

We talked a bit about this off-list but I wanted to state publicly that I think 
this model works well for our use case. Specifically,

(1) It reduces complexity by focusing on the common case -- long term we expect 
glibc to manage the process of using this feature and registering/deregistering 
threads for rseq. Unloading isn't a challenge in these situations, so why add 
the complexity for it?

(2) This still allows for early adopters to use rseq before there is glibc 
support. I believe the vast majority of real world applications meet these two 
criteria you've listed. If not, they can create a thin shared library that has 
the sole purpose of providing the weak symbol and that never gets unloaded

(3) This allows for applications to provide the __rseq_abi so that they can 
ensure it uses the initial_exec tls model and optimize in-application assembly 
code for it. This is a good optimization for server applications that tend to 
statically link.

If others agree with this, would it make sense to remove the concept of 
reference counting in the system call that defines and redefines the per-thread 
area? Seems like it would remove complexity.

-b

Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-17 Thread Ben Maurer
Hey,

> So far the restrictions I see for libraries using this symbol are:
> - They should never be unloaded,
> - They should never be loaded with dlopen RTLD_LOCAL flag.

We talked a bit about this off-list but I wanted to state publicly that I think 
this model works well for our use case. Specifically,

(1) It reduces complexity by focusing on the common case -- long term we expect 
glibc to manage the process of using this feature and registering/deregistering 
threads for rseq. Unloading isn't a challenge in these situations, so why add 
the complexity for it?

(2) This still allows for early adopters to use rseq before there is glibc 
support. I believe the vast majority of real world applications meet these two 
criteria you've listed. If not, they can create a thin shared library that has 
the sole purpose of providing the weak symbol and that never gets unloaded

(3) This allows for applications to provide the __rseq_abi so that they can 
ensure it uses the initial_exec tls model and optimize in-application assembly 
code for it. This is a good optimization for server applications that tend to 
statically link.

If others agree with this, would it make sense to remove the concept of 
reference counting in the system call that defines and redefines the per-thread 
area? Seems like it would remove complexity.

-b

Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-13 Thread Ben Maurer
Hey,

I'm really excited to hear that you're open to this patch set and totally 
understand the desire for some more numbers. I have a few thoughts and 
questions -- hopefully ones that could help better understand where you'd like 
to see more data (potentially from myself and other Facebook folks)

> A "increment percpu value" simply isn't relevant.

While I understand it seems trivial, my experience has been that this type of 
operation can actually be important in many server workloads. In applications 
with 1000s of threads, keeping a counter like this can pose a challenge. One 
can use a per-thread variable, but the size overhead here can be very large (8 
bytes per counter per thread adds up very quickly). And atomic instructions can 
cause contention quickly. Server programs tend to have many statistical 
counters, being able to implement them efficiently without size bloat is a real 
world win.

This type of per-cpu counter can also very quickly be used to implement other 
abstractions in common use -- eg an asymmetric reader-writer lock or a 
reference counted object that is changed infrequently. While thread local 
storage can also be used in these cases this can be a substantial size overhead 
and can also require cooperation between the application and the library to 
manage thread lifecycle.

At least from what I've seen of our usage of these types of abstractions within 
Facebook, if rseq met these use cases and did absolutely nothing else it would 
still be a feature that our applications would benefit from. Hopefully we can 
find evidence that it can do even more than this, but I think that this 
"trivial" use case is actually addressing a real world problem.

> When I asked last time, people pointed me to potential uses, including
> malloc libraries that could get per-thread performance with just
> per-cpu (not per thread) data structure overhead. I see that you once
> more point to the slides from 2013 that again talks about it.
>
> But people didn't post code, people didn't post numbers, and people
> didn't point to actual real uses, just "this could happen".

At Facebook we did some work to experiment with rseq and jemalloc Qi and David 
(cc'd) may be able to provide more context on the current state.

> Because without real-world uses, it's not obvious that there won't be
> somebody who goes "oh, this isn't quite enough for us, the semantics
> are subtly incompatible with our real-world use case".

Is your concern mainly this question (is this patchset a good way to bring 
per-cpu algorithms to userspace)? I'm hoping that given the variety of ways 
that per-cpu data structures are used in the kernel the concerns around this 
patch set are mainly around what approach we should take rather than if per-cpu 
algorithms are a good idea at all. If this is your main concern perhaps our 
focus should be around demonstrating that a number of useful per-cpu algorithms 
can be implemented using restartable sequences.

Ultimately I'm worried there's a chicken and egg problem here. It's hard to get 
people to commit to investing in rseq without a clear path to the patchset 
seeing the light of day. It's also easy to understand why you'd be reluctant to 
merge such a substantial and unfamiliar API without extensive data. If we're 
still not able to get compelling data, I'm wondering if there are other 
approaches that could get us unstuck, eg

(1) Could we merge enough of this patchset (eg things like the faster getcpu() 
operation, which seems like a good improvement over the current approach). If 
we make the remaining patches small enough it may be easier for sophisticated 
users to apply the remaining patches, maintain them, and provide real-world 
operational experience with this abstraction.

(2) Could we implement restartable sequences in the kernel but only allow the 
vdso to make use of them? We could have the vdso export a number of high-level 
operations (like the ones suggested in Paul Turner's original presentation -- 
per-cpu spin lock, per-cpu atomic increment/decrement, per-cpu list push/pop). 
This would allow us to get real-world data about how these primitives are used 
without committing to a complex ABI -- only committing to support the specific 
operations. If the whole idea flops we could eventually create a slow/naive 
implementation of the vdso functions and kill restartable sequences entirely.

-b

Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-13 Thread Ben Maurer
Hey,

I'm really excited to hear that you're open to this patch set and totally 
understand the desire for some more numbers. I have a few thoughts and 
questions -- hopefully ones that could help better understand where you'd like 
to see more data (potentially from myself and other Facebook folks)

> A "increment percpu value" simply isn't relevant.

While I understand it seems trivial, my experience has been that this type of 
operation can actually be important in many server workloads. In applications 
with 1000s of threads, keeping a counter like this can pose a challenge. One 
can use a per-thread variable, but the size overhead here can be very large (8 
bytes per counter per thread adds up very quickly). And atomic instructions can 
cause contention quickly. Server programs tend to have many statistical 
counters, being able to implement them efficiently without size bloat is a real 
world win.

This type of per-cpu counter can also very quickly be used to implement other 
abstractions in common use -- eg an asymmetric reader-writer lock or a 
reference counted object that is changed infrequently. While thread local 
storage can also be used in these cases this can be a substantial size overhead 
and can also require cooperation between the application and the library to 
manage thread lifecycle.

At least from what I've seen of our usage of these types of abstractions within 
Facebook, if rseq met these use cases and did absolutely nothing else it would 
still be a feature that our applications would benefit from. Hopefully we can 
find evidence that it can do even more than this, but I think that this 
"trivial" use case is actually addressing a real world problem.

> When I asked last time, people pointed me to potential uses, including
> malloc libraries that could get per-thread performance with just
> per-cpu (not per thread) data structure overhead. I see that you once
> more point to the slides from 2013 that again talks about it.
>
> But people didn't post code, people didn't post numbers, and people
> didn't point to actual real uses, just "this could happen".

At Facebook we did some work to experiment with rseq and jemalloc Qi and David 
(cc'd) may be able to provide more context on the current state.

> Because without real-world uses, it's not obvious that there won't be
> somebody who goes "oh, this isn't quite enough for us, the semantics
> are subtly incompatible with our real-world use case".

Is your concern mainly this question (is this patchset a good way to bring 
per-cpu algorithms to userspace)? I'm hoping that given the variety of ways 
that per-cpu data structures are used in the kernel the concerns around this 
patch set are mainly around what approach we should take rather than if per-cpu 
algorithms are a good idea at all. If this is your main concern perhaps our 
focus should be around demonstrating that a number of useful per-cpu algorithms 
can be implemented using restartable sequences.

Ultimately I'm worried there's a chicken and egg problem here. It's hard to get 
people to commit to investing in rseq without a clear path to the patchset 
seeing the light of day. It's also easy to understand why you'd be reluctant to 
merge such a substantial and unfamiliar API without extensive data. If we're 
still not able to get compelling data, I'm wondering if there are other 
approaches that could get us unstuck, eg

(1) Could we merge enough of this patchset (eg things like the faster getcpu() 
operation, which seems like a good improvement over the current approach). If 
we make the remaining patches small enough it may be easier for sophisticated 
users to apply the remaining patches, maintain them, and provide real-world 
operational experience with this abstraction.

(2) Could we implement restartable sequences in the kernel but only allow the 
vdso to make use of them? We could have the vdso export a number of high-level 
operations (like the ones suggested in Paul Turner's original presentation -- 
per-cpu spin lock, per-cpu atomic increment/decrement, per-cpu list push/pop). 
This would allow us to get real-world data about how these primitives are used 
without committing to a complex ABI -- only committing to support the specific 
operations. If the whole idea flops we could eventually create a slow/naive 
implementation of the vdso functions and kill restartable sequences entirely.

-b

Re: [RFC PATCH v8 1/9] Restartable sequences system call

2016-08-25 Thread Ben Maurer
On 8/25/16, 10:08 AM, "Mathieu Desnoyers"  
wrote:
> The most appealing application we have so far is Dave Watson's Facebook
>services using jemalloc as a memory allocator. It would be nice if he
>could re-run those benchmarks with my rseq implementation. The trade-offs
>here are about speed and memory usage:
 
One piece of context I’d like to provide about our investigation into using 
rseq for malloc – I think that it’s really important to keep in mind that we’ve 
optimized the software that we write to survive well in a world where there’s a 
high memory cost for jemalloc for each thread and jemalloc is unable to have 
allocation caches as large as we would like. We’re not going to have real world 
benchmarks that show a magical improvement with rseq because over time we’ve 
baked the constraints of our environment into the design of our programs and 
optimized for the current set of APIs the kernel provides. I do think rseq 
provides a benefit even for applications optimized for today’s malloc 
implementations. But the real benefit is the new types of application designed 
that rseq enables and the ability for rseq to provide predictable performance 
for low-level APIs with much less investment from users. I’ll illustrate the 
costs that rseq would let us avoid with two examples of design choices we’ve 
made:

1) Because jemalloc uses a per-thread cache, threads that are sleeping have a 
high memory cost. For example, if you have a thread-pool with 100 threads but 
only 10 are used most of the time the other 90 threads will still have a 
dormant cache consuming memory. In order to combat this we have an abstraction 
called MemoryIdler 
(https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h) 
which is essentially a wrapper around futex that signals jemalloc to release 
its caches when the thread is idle. From what I can tell this is a practice 
that isn’t widely adopted even though it can save a substantial amount of 
memory – rseq makes this a moot point since caches can be per-cpu and the 
memory allocator does not need to worry about an idle thread hogging the cache.
2) The per-thread nature of malloc implementations has generally led people to 
avoid thread-per-request designs. Things like MemoryIdler can help you if a 
thread is going to be idle for seconds before it is used again, but if your 
thread makes a 100 ms RPC to another server clearing the cache is far too 
expensive to justify. But you still have a bunch of memory sitting around 
unused for 100ms. Multiply that by many concurrent requests and you are 
consuming a lot of memory. This has forced people to handle multiple requests 
in a single thread – this leads to problems of its own like a contested lock in 
one request impacting many other requests on the same thread. 

rseq opens up a whole world of algorithms to userspace – algorithms that are 
O(num CPUs) and where one can have an extremely fast fastpath at the cost of a 
slower slow path. Many of these algorithms are in use in the kernel today – 
per-cpu allocators, RCU, light-weight reader writer locks, etc. Even in cases 
where these APIs can be implemented today, a rseq implementation is often 
superior in terms of predictability and usability (eg per-thread counters 
consume more memory and are more expensive to read than per-cpu counters).

Isn’t the large number of uses of rseq-like algorithms in the kernel a pretty 
substantial sign that there would be demand for similar algorithms by 
user-space systems programmers?

-b



Re: [RFC PATCH v8 1/9] Restartable sequences system call

2016-08-25 Thread Ben Maurer
On 8/25/16, 10:08 AM, "Mathieu Desnoyers"  
wrote:
> The most appealing application we have so far is Dave Watson's Facebook
>services using jemalloc as a memory allocator. It would be nice if he
>could re-run those benchmarks with my rseq implementation. The trade-offs
>here are about speed and memory usage:
 
One piece of context I’d like to provide about our investigation into using 
rseq for malloc – I think that it’s really important to keep in mind that we’ve 
optimized the software that we write to survive well in a world where there’s a 
high memory cost for jemalloc for each thread and jemalloc is unable to have 
allocation caches as large as we would like. We’re not going to have real world 
benchmarks that show a magical improvement with rseq because over time we’ve 
baked the constraints of our environment into the design of our programs and 
optimized for the current set of APIs the kernel provides. I do think rseq 
provides a benefit even for applications optimized for today’s malloc 
implementations. But the real benefit is the new types of application designed 
that rseq enables and the ability for rseq to provide predictable performance 
for low-level APIs with much less investment from users. I’ll illustrate the 
costs that rseq would let us avoid with two examples of design choices we’ve 
made:

1) Because jemalloc uses a per-thread cache, threads that are sleeping have a 
high memory cost. For example, if you have a thread-pool with 100 threads but 
only 10 are used most of the time the other 90 threads will still have a 
dormant cache consuming memory. In order to combat this we have an abstraction 
called MemoryIdler 
(https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h) 
which is essentially a wrapper around futex that signals jemalloc to release 
its caches when the thread is idle. From what I can tell this is a practice 
that isn’t widely adopted even though it can save a substantial amount of 
memory – rseq makes this a moot point since caches can be per-cpu and the 
memory allocator does not need to worry about an idle thread hogging the cache.
2) The per-thread nature of malloc implementations has generally led people to 
avoid thread-per-request designs. Things like MemoryIdler can help you if a 
thread is going to be idle for seconds before it is used again, but if your 
thread makes a 100 ms RPC to another server clearing the cache is far too 
expensive to justify. But you still have a bunch of memory sitting around 
unused for 100ms. Multiply that by many concurrent requests and you are 
consuming a lot of memory. This has forced people to handle multiple requests 
in a single thread – this leads to problems of its own like a contested lock in 
one request impacting many other requests on the same thread. 

rseq opens up a whole world of algorithms to userspace – algorithms that are 
O(num CPUs) and where one can have an extremely fast fastpath at the cost of a 
slower slow path. Many of these algorithms are in use in the kernel today – 
per-cpu allocators, RCU, light-weight reader writer locks, etc. Even in cases 
where these APIs can be implemented today, a rseq implementation is often 
superior in terms of predictability and usability (eg per-thread counters 
consume more memory and are more expensive to read than per-cpu counters).

Isn’t the large number of uses of rseq-like algorithms in the kernel a pretty 
substantial sign that there would be demand for similar algorithms by 
user-space systems programmers?

-b



RE: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)

2015-07-17 Thread Ben Maurer
Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register a userspace memory
> area where to store the current CPU number. Scheduler migration sets the

I really like that this approach makes it easier to add a per-thread 
interaction between userspace and the kernel in the future.

>+   if (!tlap || t->thread_local_abi_len <
>+   offsetof(struct thread_local_abi, cpu)
>+   + sizeof(tlap->cpu))

Could you save a branch here by enforcing that thread_local_abi_len = 0 if 
thread_local_abi = null?

-b--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)

2015-07-17 Thread Ben Maurer
Mathieu Desnoyers wrote:
 Expose a new system call allowing threads to register a userspace memory
 area where to store the current CPU number. Scheduler migration sets the

I really like that this approach makes it easier to add a per-thread 
interaction between userspace and the kernel in the future.

+   if (!tlap || t-thread_local_abi_len 
+   offsetof(struct thread_local_abi, cpu)
+   + sizeof(tlap-cpu))

Could you save a branch here by enforcing that thread_local_abi_len = 0 if 
thread_local_abi = null?

-b--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/