Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-26 Thread Gábor Lehel
On Sun, Jan 26, 2014 at 1:26 PM, Matthieu Monrocq 
matthieu.monr...@gmail.com wrote:




 On Sun, Jan 26, 2014 at 3:31 AM, Brian Anderson bander...@mozilla.comwrote:

 On 01/25/2014 08:58 AM, Bill Myers wrote:

 Stack management for green tasks has been based in the past first on
 segmented stacks and then on standard large stacks.

 However, I just realized that there is a third alternative which might
 well be better than both of those.

 The idea is very simple: a green task would run on a large stack like
 now, but when it is descheduled, a memory buffer of the size of its used
 stack space is allocated, and its stack data is copied there; when it is
 rescheduled, the stack data is copied back to the original address.


 That's a really clever idea. There are though a number of situations
 where we unsafely read from or write into other task's stacks that would
 need to be considered carefully.


 Would it be possible to statically eliminate this risk by preventing the
 promotion of such affected tasks as greenlets ? Or is there today too many
 ways to read into a greenlet stack ?


We *already* statically prevent tasks from reading/writing each others'
stacks, the issue here is code which uses `unsafe { }` and does it anyways,
based on particular assumptions about how stacks are implemented which
would no longer be valid. The solution would be to (considering it
carefully) fix this unsafe code to no longer rely on these assumptions.



 -- Matthieu




 ___
 Rust-dev mailing list
 Rust-dev@mozilla.org
 https://mail.mozilla.org/listinfo/rust-dev



 ___
 Rust-dev mailing list
 Rust-dev@mozilla.org
 https://mail.mozilla.org/listinfo/rust-dev


___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-26 Thread Gábor Lehel
On Sat, Jan 25, 2014 at 9:26 PM, Vadim vadi...@gmail.com wrote:

 Hi Bill,
 If I understand this right, what you are proposing is the approach of
 Python's greenlets library, which I already tried to pitch 
 herehttp://mail.mozilla.org/pipermail/rust-dev/2013-November/006544.html.


 I think that the biggest problem with this approach, in its' raw form, is
 that greenlets are bound to a specific OS thread, which would cause all
 sorts of headache.  For example:
 - It would be difficult to balance greenlet tasks between several OS
 threads,
 - If async I/O requests for two greenlets, bound to the same OS thread,
 were to complete simultaneously, you would not be able to resume them in
 parallel,
 - If a thread were to become blocked for a significant amount of time, all
 the greenlets bound to it become un-resumable,
 and so on...

 Greelets would be much more palatable if their stacks were
 position-independent.  This might be possible to achieve with a suitable
 LLVM 
 transformhttp://lists.cs.uiuc.edu/pipermail/llvmdev/2014-January/069662.html,
 though I am not yet sure how portable this would be.


Totally not an expert in this area, but I'm wondering two things:

 - Once you hoist all variables whose address is taken out into this
secondary stack frame, if you have a pointer on the stack to an object on
the stack who *also*, in turn, has its address taken, then I believe you
would end up with a pointer in the secondary stack frame to an object in
the secondary stack frame, i.e. the secondary stack frame itself would not
be position independent. Does the secondary stack frame always stay in
place, so we don't really care about this?

 - Instead of heap-allocating a secondary stack frame, could another
potential solution be to use similar infrastructure as for DWARF-based
unwinding and/or precise GC tracing of the stack, in other words to emit a
side table of the positions of pointers into the stack in each stack frame,
and then when relocating the stack, to consult this side table and manually
fix-up all of the affected pointers? In this case, aside from the presence
of the side tables, the cost is borne only when actually relocating the
stack.





 cheers,
 Vadim



 On Sat, Jan 25, 2014 at 8:58 AM, Bill Myers bill_my...@outlook.comwrote:

 Stack management for green tasks has been based in the past first on
 segmented stacks and then on standard large stacks.

 However, I just realized that there is a third alternative which might
 well be better than both of those.

 The idea is very simple: a green task would run on a large stack like
 now, but when it is descheduled, a memory buffer of the size of its used
 stack space is allocated, and its stack data is copied there; when it is
 rescheduled, the stack data is copied back to the original address.

 The copying can be performed either by memcpy (perhaps with a specialized
 memcpy that assumes alignment and always uses SIMD instruction to copy with
 no checks), or perhaps by using a compression algorithm designed for
 maximum speed, such as Snappy or maybe a new ad-hoc design.

 This allows to have the best possible memory efficiency and thus the
 maximum possible number of tasks, even better than segmented stacks due to
 precise sizing and optional compression, while not adding any overhead to
 code that does not block, either in terms of code generation complexity or
 runtime cost.

 There is of course an additional runtime cost when blocking, but blocking
 already often involves both system calls and context switching, and
 workloads with lots of blocking green tasks are probably I/O bound anyway;
 furthermore, stack data is probably in the CPU cache, so there shouldn't be
 much penalty.

 The only significant drawback is that two green tasks running from the
 same large stack (note that stacks cannot be easily switched, as that
 would invalidate borrowed pointers on the stack to the stack) cannot
 ever run concurrently; however, by making the number of large stacks a
 large enough multiple of the number of CPU cores, the probability of
 reduced parallelism due to this issue can be made as small as desired.

 In general, one would use an adaptive approach, where this kind of
 copying would start to kick in only once the number of green tasks becomes
 large; when not copying, one would just optionally use
 madvise(MADV_DONTNEED) to trim unused whole pages of the stack.

 Also, if the data to be copied is large (several OS pages), one may use
 mremap() or similar facilities to move the address space mappings instead
 of the data itself.

 Some unsafe code that assumes that tasks can access each other's stacks
 will break (probably only the new Mutex implementation), but that can be
 fixed by putting the data in the task structure instead of the stack.

 Overall, this should allow writing something like a parallel network
 server or web crawler in Rust in a natural blocking style, while being
 fully confident in its scalability, which is something 

Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-26 Thread Gábor Lehel
On Sun, Jan 26, 2014 at 2:02 PM, Gábor Lehel glaebho...@gmail.com wrote:




 On Sat, Jan 25, 2014 at 9:26 PM, Vadim vadi...@gmail.com wrote:

 Hi Bill,
 If I understand this right, what you are proposing is the approach of
 Python's greenlets library, which I already tried to pitch 
 herehttp://mail.mozilla.org/pipermail/rust-dev/2013-November/006544.html.


 I think that the biggest problem with this approach, in its' raw form, is
 that greenlets are bound to a specific OS thread, which would cause all
 sorts of headache.  For example:
 - It would be difficult to balance greenlet tasks between several OS
 threads,
 - If async I/O requests for two greenlets, bound to the same OS thread,
 were to complete simultaneously, you would not be able to resume them in
 parallel,
 - If a thread were to become blocked for a significant amount of time,
 all the greenlets bound to it become un-resumable,
 and so on...

 Greelets would be much more palatable if their stacks were
 position-independent.  This might be possible to achieve with a suitable
 LLVM 
 transformhttp://lists.cs.uiuc.edu/pipermail/llvmdev/2014-January/069662.html,
 though I am not yet sure how portable this would be.


 Totally not an expert in this area, but I'm wondering two things:

  - Once you hoist all variables whose address is taken out into this
 secondary stack frame, if you have a pointer on the stack to an object on
 the stack who *also*, in turn, has its address taken, then I believe you
 would end up with a pointer in the secondary stack frame to an object in
 the secondary stack frame, i.e. the secondary stack frame itself would not
 be position independent. Does the secondary stack frame always stay in
 place, so we don't really care about this?


(Thinking about this further, this was a silly question, because the
primary stack frame itself would have pointers into the secondary one, so
_obviously_ the latter cannot move.)
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


[rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread Bill Myers
Stack management for green tasks has been based in the past first on segmented 
stacks and then on standard large stacks.

However, I just realized that there is a third alternative which might well be 
better than both of those.

The idea is very simple: a green task would run on a large stack like now, but 
when it is descheduled, a memory buffer of the size of its used stack space is 
allocated, and its stack data is copied there; when it is rescheduled, the 
stack data is copied back to the original address.

The copying can be performed either by memcpy (perhaps with a specialized 
memcpy that assumes alignment and always uses SIMD instruction to copy with no 
checks), or perhaps by using a compression algorithm designed for maximum 
speed, such as Snappy or maybe a new ad-hoc design.

This allows to have the best possible memory efficiency and thus the maximum 
possible number of tasks, even better than segmented stacks due to precise 
sizing and optional compression, while not adding any overhead to code that 
does not block, either in terms of code generation complexity or runtime cost.

There is of course an additional runtime cost when blocking, but blocking 
already often involves both system calls and context switching, and workloads 
with lots of blocking green tasks are probably I/O bound anyway; furthermore, 
stack data is probably in the CPU cache, so there shouldn't be much penalty.

The only significant drawback is that two green tasks running from the 
same large stack (note that stacks cannot be easily switched, as that 
would invalidate borrowed pointers on the stack to the stack) cannot 
ever run concurrently; however, by making the number of large stacks a 
large enough multiple of the number of CPU cores, the probability of 
reduced parallelism due to this issue can be made as small as desired.

In general, one would use an adaptive approach, where this kind of copying 
would start to kick in only once the number of green tasks becomes large; when 
not copying, one would just optionally use madvise(MADV_DONTNEED) to trim 
unused whole pages of the stack.

Also, if the data to be copied is large (several OS pages), one may use 
mremap() or similar facilities to move the address space mappings instead of 
the data itself.

Some unsafe code that assumes that tasks can access each other's stacks will 
break (probably only the new Mutex implementation), but that can be fixed by 
putting the data in the task structure instead of the stack.

Overall, this should allow writing something like a parallel network server or 
web crawler in Rust in a natural blocking style, while being fully confident in 
its scalability, which is something that is not really possible at the moment.

Once this is in place, the best practices for Rust programs would become:
1. Use native tasks for tasks whose number is bounded at compile time and small
2. Use green tasks for tasks whose number is unbounded or large

One could even automate this to some extent by spawning by default a native 
task the first C times a given proc() code address is used to start a task 
(where C = number of CPU cores), and green tasks afterwards.

What do you think?
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread Vadim
Hi Bill,
If I understand this right, what you are proposing is the approach of
Python's greenlets library, which I already tried to pitch
herehttp://mail.mozilla.org/pipermail/rust-dev/2013-November/006544.html.


I think that the biggest problem with this approach, in its' raw form, is
that greenlets are bound to a specific OS thread, which would cause all
sorts of headache.  For example:
- It would be difficult to balance greenlet tasks between several OS
threads,
- If async I/O requests for two greenlets, bound to the same OS thread,
were to complete simultaneously, you would not be able to resume them in
parallel,
- If a thread were to become blocked for a significant amount of time, all
the greenlets bound to it become un-resumable,
and so on...

Greelets would be much more palatable if their stacks were
position-independent.  This might be possible to achieve with a suitable
LLVM 
transformhttp://lists.cs.uiuc.edu/pipermail/llvmdev/2014-January/069662.html,
though I am not yet sure how portable this would be.


cheers,
Vadim



On Sat, Jan 25, 2014 at 8:58 AM, Bill Myers bill_my...@outlook.com wrote:

 Stack management for green tasks has been based in the past first on
 segmented stacks and then on standard large stacks.

 However, I just realized that there is a third alternative which might
 well be better than both of those.

 The idea is very simple: a green task would run on a large stack like now,
 but when it is descheduled, a memory buffer of the size of its used stack
 space is allocated, and its stack data is copied there; when it is
 rescheduled, the stack data is copied back to the original address.

 The copying can be performed either by memcpy (perhaps with a specialized
 memcpy that assumes alignment and always uses SIMD instruction to copy with
 no checks), or perhaps by using a compression algorithm designed for
 maximum speed, such as Snappy or maybe a new ad-hoc design.

 This allows to have the best possible memory efficiency and thus the
 maximum possible number of tasks, even better than segmented stacks due to
 precise sizing and optional compression, while not adding any overhead to
 code that does not block, either in terms of code generation complexity or
 runtime cost.

 There is of course an additional runtime cost when blocking, but blocking
 already often involves both system calls and context switching, and
 workloads with lots of blocking green tasks are probably I/O bound anyway;
 furthermore, stack data is probably in the CPU cache, so there shouldn't be
 much penalty.

 The only significant drawback is that two green tasks running from the
 same large stack (note that stacks cannot be easily switched, as that
 would invalidate borrowed pointers on the stack to the stack) cannot
 ever run concurrently; however, by making the number of large stacks a
 large enough multiple of the number of CPU cores, the probability of
 reduced parallelism due to this issue can be made as small as desired.

 In general, one would use an adaptive approach, where this kind of copying
 would start to kick in only once the number of green tasks becomes large;
 when not copying, one would just optionally use madvise(MADV_DONTNEED) to
 trim unused whole pages of the stack.

 Also, if the data to be copied is large (several OS pages), one may use
 mremap() or similar facilities to move the address space mappings instead
 of the data itself.

 Some unsafe code that assumes that tasks can access each other's stacks
 will break (probably only the new Mutex implementation), but that can be
 fixed by putting the data in the task structure instead of the stack.

 Overall, this should allow writing something like a parallel network
 server or web crawler in Rust in a natural blocking style, while being
 fully confident in its scalability, which is something that is not really
 possible at the moment.

 Once this is in place, the best practices for Rust programs would become:
 1. Use native tasks for tasks whose number is bounded at compile time and
 small
 2. Use green tasks for tasks whose number is unbounded or large

 One could even automate this to some extent by spawning by default a
 native task the first C times a given proc() code address is used to start
 a task (where C = number of CPU cores), and green tasks afterwards.

 What do you think?
 ___
 Rust-dev mailing list
 Rust-dev@mozilla.org
 https://mail.mozilla.org/listinfo/rust-dev

___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread Patrick Walton

On 1/25/14 8:58 AM, Bill Myers wrote:

Stack management for green tasks has been based in the past first on
segmented stacks and then on standard large stacks.


I assume this is incompatible with work stealing and task migration 
between threads?


Patrick

___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread Bill Myers
 I assume this is incompatible with work stealing and task migration
 between threads?

It is compatible, assuming that the number of large stacks is sufficiently 
larger than the number of threads.

Basically, each green task can only run on a specific large stack, but as long 
as you aren't unlucky that all runnable green tasks you'd steal are running on 
a large stack that has already a running green task, then you can steal one.

The probability of that obviously decreases the more large stacks you use; 
large stacks can only consume address space (if you wipe them with 
madvise(MADV_DONTNEED) or similar when unused), so one can reasonably have 256 
8MB large stacks on 32-bit, for instance.

So for instance, if you had an HT 4-core machine with 8 system threads running 
green tasks and 1024 large stacks, the probability of a green task that is not 
blocked but not executable (because its large stack is in use) is 7/256; 
obviously if K tasks are not blocked, then the probability of having none 
executable becomes (7/256)^K (i.e. exponentially lower).

On 64-bit, the probability can be made arbitrarily low, although you consume 
more kernel memory to store the metadata associated with the memory mappings 
for the large stacks.

It would require to rearchitecture the work stealing code to work with a 
two-level data structure and first steal a large stack with at least one 
non-blocked task, and then run the task, rather than directly stealing the 
task.
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread Bill Myers
Interesting: my proposal appears to be indeed a generalization of the greenlet 
approach.

Specifically, while the greenlet proposal seems to only use one large stack per 
native thread, I'm suggesting to use multiple large stacks that can be stolen 
by other threads, which does mitigate the issues stemming from non-position 
dependent stacks (they are not eliminated, but they have low probability of 
happening).

It's also indeed possible to fully eliminate those issues by autoboxing 
everything whose address is taken, but that would have a potentially large 
performance impact on non-blocking code, while merely copying the stack only 
imposes a performance penalty to code that blocks.  
  
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread Bill Myers
The ratio of native threads to stacks and of stacks to tasks can actually be 
used to characterize all systems discussed.

(stacks/thread, tasks/stacks)
(1, 1) = current Rust native tasks
(1, N) = Python greenlets
(N, 1) = current Rust green tasks
(N, M) = proposal in my original mail
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread comex
To me, issues such as large memcpys/system calls on context switches,
having tasks not be able to compute concurrently with random other
tasks, possibly having to autobox everything, etc. sound like they
would spawn equal or greater complexity and performance loss than just
using split stacks.

If you really wanted to avoid wasting memory, couldn't you allow new
tasks to steal memory below the stack pointer of swapped-out tasks,
changing that task's stack bottom pointer in the process?
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Idea: Memcpyed stacks for green tasks

2014-01-25 Thread Brian Anderson

On 01/25/2014 08:58 AM, Bill Myers wrote:

Stack management for green tasks has been based in the past first on segmented 
stacks and then on standard large stacks.

However, I just realized that there is a third alternative which might well be 
better than both of those.

The idea is very simple: a green task would run on a large stack like now, but 
when it is descheduled, a memory buffer of the size of its used stack space is 
allocated, and its stack data is copied there; when it is rescheduled, the 
stack data is copied back to the original address.


That's a really clever idea. There are though a number of situations 
where we unsafely read from or write into other task's stacks that would 
need to be considered carefully.


___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev