Introducing taskpools, a small lightweight ARC/ORC threadpool

2023-11-29 Thread Zoom
I'm still wondering, since taskpools are presented as an ARC/ORC threadpool, 
maybe [that line in the 
Readme](https://github.com/status-im/nim-taskpools/blob/15e23ef1cf0860330dcc32f50fcce5f840031e28/README.md?plain=1#L131)
 saying that _" Using no GC or --gc:arc, --gc:orc or --gc:boehm"_ is a non-goal 
should be fixed.


Introducing taskpools, a small lightweight ARC/ORC threadpool

2022-01-13 Thread kobi
That sounds very cool. Is it ready for usage? What would be the api and does it 
have documentation?


Introducing taskpools, a small lightweight ARC/ORC threadpool

2022-01-13 Thread mratsim
Adding a multithreaded event system to a threadpool can be done with the 
protocol described below.

So in terms of vocabulary an event system allows to describe precise in/out 
dependencies between tasks and is also called by many names in the literature:

  * dataflow parallelism
  * graph parallelism
  * data-driven task parallelism
  * pipeline parallelism
  * stream parallelism



I've looked in papers/impls how it's done in other runtimes but they all seemed 
pretty involved in terms of syntax, integration in the runtime and also 
overhead (like requiring hash tables of launchable tasks or an explicit 
dataflow graph):

  * LLVM/Intel OpenMP: 
,
 API slide 21: 

  * Cpp Taskflow: 
,
 API in the README with explicit graph construction/edge connection
  * Swan (an extension of Intel Cilk): 
, API 
slide 3: 

  * Intel TBB: 
,
 API in 
 (have 
fun)
  * See also research here: 



Let's start with the API, in Weave, but adaptable to taskpools or any 
threadpool:


block: # Delayed computation
  
  proc echoA(eA: FlowEvent) =
echo "Display A, sleep 1s, create parallel streams 1 and 2"
sleep(1000)
eA.trigger()
  
  proc echoB1(eB1: FlowEvent) =
echo "Display B1, sleep 1s"
sleep(1000)
eB1.trigger()
  
  proc echoB2() =
echo "Display B2, exit stream"
  
  proc echoC1(): bool =
echo "Display C1, exit stream"
return true
  
  proc main() =
echo "Sanity check 3: Dataflow parallelism"
init(Weave)
let eA = newFlowEvent()
let eB1 = newFlowEvent()
let done = spawnOnEvent(eB1, echoC1())
spawnOnEvent eA, echoB2()
spawnOnEvent eA, echoB1(eB1)
spawn echoA(eA)
discard sync(done)
exit(Weave)
  
  main()

block: # Delayed computation with multiple dependencies
  
  proc echoA(eA: FlowEvent) =
echo "Display A, sleep 1s, create parallel streams 1 and 2"
sleep(1000)
eA.trigger()
  
  proc echoB1(eB1: FlowEvent) =
echo "Display B1, sleep 1s"
sleep(1000)
eB1.trigger()
  
  proc echoB2(eB2: FlowEvent) =
echo "Display B2, no sleep"
eB2.trigger()
  
  proc echoC12() =
echo "Display C12, exit stream"
  
  proc main() =
echo "Sanity check 4: Dataflow parallelism with multiple dependencies"
init(Weave)
let eA = newFlowEvent()
let eB1 = newFlowEvent()
let eB2 = newFlowEvent()
spawnOnEvents eB1, eB2, echoC12()
spawnOnEvent eA, echoB2(eB2)
spawnOnEvent eA, echoB1(eB1)
spawn echoA(eA)
exit(Weave)
echo "Weave runtime exited"
  
  main()


Run

Proc that are dependencies accept a FlowEvent parameter, and they 
`event.trigger()` when necessary which will schedule all procs that depended on 
that event. Multiple dependencies are possible. Function calls already builds 
an implicit graph, I don't see the advantage of explicit graph building like in 
Taskflow or Intel TBB. Testing and integration is also way easier since there 
is no graph data structure, just function calls.

Even in deep learning, implicit neural network graph builder (PyTorch) won by 
being significantly more productive, easier to use and debug than explicit 
graph builder (Tensorflow).

Implementation is here: 
.
 It is multithreading runtime/threadpool agnostic, the internals of Weave don't 
know anything about events or task graphs, only `spawnOnEvent` and 
`spawnOnEvents` do and they only need a way to enqueue an immediately runnable 
task in the threadpool work queue(s).

# Protocol

A flow event is an ownerless MPSC channel that holds tasks. The number of tasks 
in the channel is bounded by the number of dependent tasks.

When a worker triggers an event, it becomes the unique consumer of the MPSC 
channel. It flags the event as triggered and drain the channel of all tasks.

When a task is dependent on an event, the worker that received the dependent 
task checks the triggered flag. Case 1: It is already triggered, it schedules 
the task as a normal task Case 2: It is not triggered, it sends the task in the 
flow event MPSC channel.

Tasks with multiple dependencies are represented by

Introducing taskpools, a small lightweight ARC/ORC threadpool

2022-01-12 Thread elcritch
Brilliant! I'm literally just looking into how to do an event/thread pool type 
of problem in embedded (Zephyr in this case). The sync/spawn types are pretty 
nice syntax and much better than easier to use than the system built-in work 
queues. I don't want to run ORC, so this might work nicely. Thanks! 


Introducing taskpools, a small lightweight ARC/ORC threadpool

2022-01-12 Thread mratsim
@Araq, sure you can use whatever style. If readability is a problem you can 
also change fonts ;).

@elcritch, it works with plain-old-data (no need for arc) and any GC that is 
not thread-local (i.e. memory can be reclaimed from any thread), so 
malloc/Boehm/ARC/ORC would work.


Introducing taskpools, a small lightweight ARC/ORC threadpool

2022-01-11 Thread elcritch
Looks great! Does the async work with ARC only or does it require ORC? 


Introducing taskpools, a small lightweight ARC/ORC threadpool

2022-01-11 Thread Araq
Awesome project, thanks for sharing!

Nitpicking: The readability of


let x = tp.spawn async_fib(n-1)
let y = async_fib(n-2)



Run

is really bad though. Would following NEP1 improve it?


let x = tp.spawn asyncFib(n-1)
let y = asyncFib(n-2)



Run

Yes, unsurprisingly. ;-)


Introducing taskpools, a small lightweight ARC/ORC threadpool

2022-01-10 Thread mratsim
Late September, I've written Taskpools 
(, a threadpool library with the 
following goals:

  * lightweight: the threadpool depends on few moving parts, unused memory is 
reclaimed.
  * energy-efficient: unused threads are parked to save on power consumption.
  * easily auditable and maintainable: Taskpools is used in blockchain, 
correctness and maintainability is of utmost importance as reputation and money 
are on the line.



The library has been running 24/7 for the last couple months on our Ethereum 
fleet in test networks and some in production.

Example usage: 



import ../taskpools/taskpools

block: # Async without result
  
  proc display_int(x: int) =
stdout.write(x)
stdout.write(" - SUCCESS\n")
  
  proc main() =
echo "\nSanity check 1: Printing 123456 654321 in parallel"

var tp = Taskpool.new(numThreads = 4)
tp.spawn display_int(123456)
tp.spawn display_int(654321)
tp.shutdown()
  
  main()

block: # Async/Await
  
  var tp: Taskpool
  
  
  proc async_fib(n: int): int =
if n < 2:
  return n

let x = tp.spawn async_fib(n-1)
let y = async_fib(n-2)

result = sync(x) + y
  
  proc main2() =
echo "\nSanity check 2: fib(20)"

tp = Taskpool.new()
let f = async_fib(20)
tp.shutdown()

doAssert f == 6765
  
  main2()


Run

The primitives are:

  * `Taskpool.new()`
  * `tp.shutdown()`
  * `let fut = tp.spawn fn(a, b, c)`
  * `tp.sync(fut)` (~await)
  * `tp.syncAll()` blocks until all tasks are done, in particular tasks that 
don't return anything
  * `fut.isSpawned()`
  * `fut.isReady()`



The main logic is really short, just under 500 lines of code 
.
 The threadpool is work-stealing based, lock-free, except for the logic to put 
threads to sleep (the lockfree alternative "eventcount" is quite error-prone 
and would need formal verification to ensure no deadlocks are possible and run 
contrary to the "easily auditable/maintainable goal).

Compared to Weave, here are the main differences:

  * Taskpool work-stealing is shared-memory based, Weave is message-passing 
based. This has the advantages that there is no cooperation needed and if a 
thread is blocked (say on IO) other threads will always make progress. This has 
the disadvantage that advanced scheduling and load balancing techniques like 
stealing many tasks depending on perf indicators or adaptative loop splitting 
are impossible (?).
  * No parallelFor (data parallelism) and no deferred event 
scheduling/fine-grained dependencies (dataflow parallelism)
  * Less efficient load balancing for very short tasks (~10µs) or splittable 
loops. Tasks within the 500µs should reach the same performance.
  * More overhead for very short tasks. Weave has an adaptative memory pool 
based on state-of-the-art memory allocator (Snmalloc and Mimalloc) while each 
task generates many allocation in taskpools.



Note, despite all those perceived shortcomings, taskpools should be a high 
performing threadpool even compared to all other languages, and especially 
compared to the one included in the standard library.

While it relies on the 1.6 std/tasks, it has a compatibility shim to support 
Nim 1.2 and Nim 1.4 as well.