RE: [Haskell-cafe] Grouping and SIMD in parallel Haskell (using Nested Data Parallel Haskell ideas in legacy code)

2009-08-18 Thread Zefirov Sergey
 -Original Message-
 From: Eugene Kirpichov [mailto:ekirpic...@gmail.com]
 Sent: Tuesday, August 18, 2009 11:28 AM
 To: Zefirov Sergey
 Cc: haskell-cafe@haskell.org
 Subject: Re: [Haskell-cafe] Grouping and SIMD in parallel Haskell
 (using Nested Data Parallel Haskell ideas in legacy code)
 
 Now I finally understood the idea that you were talking about :) I
 didn't quite get it during the MskHUG meeting.
 The idea is brilliant to my mind; I am surprised that none of the
 gurus are answering.
 
 So, if I understood it correctly, you are suggesting to:
  - make a separate spark queue (implemented by a linked list of
 groupSize-sized arrays) for every thunk type in the program that can
 possibly be the argument of a `par` or `pseq` (which, however, may
 turn out to be almost all thunk types present in the program)

Yep.

And it could be beneficiary by itself.

  - automagically create vectorized versions of these thunks' code

If we can.

Looks like we can. ;)

  - spawn sparks by appending them to the corresponding queue (to a
 non-locked array node?)

Sparks queues can be non-locked if they belong to threads. Or, in other words, 
each thread has it's own spark's queue and all threads share main queue.

If sparks queues are shared between threads, they should be locked.

My model suggests that private spark queues slows down execution speed, but I 
haven't paid enough attention to the subject so my code could be wrong.

Look yourself:
usePrivateGroups  maxABTaskLength  ticks (fib 15)
0 07568
0 16   3157
0 32   2977
1 07753
1 16   3772
1 32   4242
All other parameters are: cpuCount 16 taskFetchsPerCycle 1 thunksPerAddition 1 
cyclesPerAddition 1.

Anyway, let's say we have several processors and they all share all queues. 
When one processor encounter that some queue is locked it immediately proceed 
to another queue, which could be locked with less probability.

  - evaluate sparks in a vectorized fashion by invoking the vectorized
 code over groups

Again, if we can.


 The problems I see here are:
  - not all thunks code can be vectorized automatically

I tried to find a contradictory example and failed. But I haven't tried very 
hard. ;)

I noticed that first argument for par should be considered strict. So we can 
demand that fib will receive Int# instead of boxed (and delayed) Int.

Operations on [:Int#:] ([:Float#:], [:Char#:]) should be very efficient.

An argument could be delayed (lazy) or of an algebraic type (like Maybe). Here 
we will have forcing of computation or pattern matching and, therefore, 
conditional execution.

Conditional execution could be expressed in parallel SIMD operations, an 
example is given at Tim Sweeney presentation (near the end).
(http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf)

The version there isn't best possible, but it works.

  - how should we represent the sparks in the queue for a certain thunk
 type? It's clear that it is no longer needed to put the whole closure
 into the spark queue, since the closure code is already known and
 fixed for each queue. For efficient vectorization, the queue probably
 should consist of closure data arrays.

I think we would borrow closure representation from NDP Haskell. ;)

  - I do not immediately see an efficient (lockless) way of picking a
 nonempty spark queue; however, I am sure there must be such a way, and
 even if there isn't, the inefficiency may well be overweighted by the
 efficiency gain from vectorization

With different sparks queues we prevent frequent locking of main spark queue. 
CPUs get more work per single lock. I think it's good in itself, vectorization 
could only add to it.

Look at the table:
maxABTaskLength  ticks (fib 15)
07568
15651
24656
43663
83263
16   3157
32   2977
64   2931
(other model parameters: cpuCount 16 usePrivateGroups 0 maxABTaskLength 
variable taskFetchsPerCycle 1 thunksPerAddition 1 cyclesPerAddition 1)

The effect on vectorization according to my model is negligible: 
thunksPerAddition 2 results in 2894 ticks and then time stop lowering. Increase 
in taskFetchsPerCycle also does not provide any noticeable speed up.

I cannot promise because, but I will try to implement my idea because I already 
see it as a good idea. ;)

PS
I always fascinated how Haskell is amenable for RTS changes. Add a complication 
underneath once and free yourself from complication on the surface for ever. ;)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Looking up functions inQ monad (Template Haskell).

2009-08-18 Thread Zefirov Sergey
Why isn't possible to lookup and call user-defined functions while performing 
Template Haskell actions?

Just like the way Template Haskell looks up and calls 'f' in $(f ...). 'f' gets 
looked up and called. And user of Q monad cannot do that.

A colleague of mine, a Lisp user, says that almost everything is in place. It's 
just not enabled.

Is there any problem we do not understand?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Grouping and SIMD in parallel Haskell (using Nested Data Parallel Haskell ideas in legacy code)

2009-08-17 Thread Zefirov Sergey
I haven't had enough time to polish a paper about the subject, so I decided to 
post my results here, in Haskell Café.

When Simon Peyton-Jones was in Moscow about a month ago I made a bold statement 
that Parallel Haskell programs, expressed with the help of par and pseq, can be 
transformed into Nested Data Parallel Haskell.

I wrote a simple model of what will be if we group arguments of par and pseq 
into queues based on parallel arrays from NDPH. We could then evaluate all 
thunks from that queues in parallel using SIMD (or just getting higher ILP). We 
also do not have to lock out main spark queue, we lock only queue we should put 
argument in. The latter turned out to be beneficial in itself, at least in my 
simple model.

Let us look at famous parallel fib function:
Fib n
| N = 1 = 1
| otherwise = a `par` b `pseq` a+b
where
a = fib (n-1)
b = fib (n-2)

When we evaluate fib we put a spark of a into spark queue and proceed to 
evaluate b and, then, a+b. When we put a into spark queue, we lock queue out, 
modify and release. There is a big probability that threads on different cores 
will compete for queue lock and some of them (most of them) will waste time 
waiting for lock release.

If we group a's into different queue and put to main spark queue a spark to 
evaluate a complete group of a's at once, we will get less wasted time.

This will work even for single CPU. Below is a run of (fib 15) on my model with 
cpuCount 1, 4 and 16 and a's or b's group length 0 (no grouping) and 16:

cpuCount  groupLength=0  groupLength=16
  modelTicks modelTicks
1 34535  27189
4 12178  7472
167568   3157
speedup   2.19 times 3.11 times
ticks1/ticks16

I think results speak for itself. I think the idea of `par` argument grouping 
could be viable.

I should note that I made several digressions when I wrote model. One of 
digressions is that all evaluations are put into queue. In (a `par` b `pseq` 
a+b) a put into a_queue, b put into b_queue and (a+b) put into main queue. Each 
evaluated spark update it's parent - a spark that wait for it.

Also, main loop of single CPU changed from simple (reading main spark queue + 
execute when get something) into a series of attepmts with fall back on failure:
- first read main queue and execute spark if succeed,
- else read current a_queue and execute all sparks there if succeed,
- else read current b_queue and execute all sparks there if succeed,
- else go to main loop.

The new (transformed) code for our fib below:

-- |Create a new queue based on parallel array. It holds a parallel array 
with current arguments and a function that performs
-- computation in RTS monad (evaluation function).
newQueueParArray :: (x - RTS ()) - RTSRef ([: x :],x - RTS ())
a_queue = unsafePerformIO $ newQueueParArray (\x - fib (x-1)) -- RTSRef 
(Int,Int - Int)
b_queue = unsafePerformIO $ newQueueParArray (\x - fib (x-2)) -- RTSRef 
(Int,Int - Int)

fib n caller
| n = 1 = 1
| otherwise = unsafePerformIO $ do
Ab - addToMainQueue (defer (+) caller)
A' - addToParArrQueue a_queue x ab
B' - addToParArrQueue b_queue x ab
addToMainQueue ab -- add a spark to check a' and b' evaluation 
status, compute a+b and update the caller.

That transfomation cannot be done at the source level using usual type 
(class/families) hackery. It could be done, though, using core-to-core 
transformations.

It is clear that several values of same type (Int for fib) and a function to 
perform operations over them leads to SIMD execution.

I made some provisions to exploit SIMD and ILP in my model. It can load more 
than single task information and add several values per cycle.

This also speeds execution up, but not so radically (about 3%).

The source code for model is here: 
http://82.146.47.211/attachment/wiki/ndpph/ndp-ph.hs (it's a MskHUG.ru domain, 
we have temporary problems with DNS). It short of comments, but I tried to make 
understandable function names.

Compile it with 'ghc -o ndp-ph --make -O2 ndp-ph' and run with a command line 
like the following:

 ndp-ph.exe cpuCount 1 usePrivateGroups 0 maxABTaskLength 64 
taskFetchsPerCycle 1 thunksPerAddition 8 cyclesPerAddition 1

I decided to create a model of exexcution instead of modifying an existing 
implementation because I have not enough time. I wrote it over evenings and a 
weekend, so it is simple, it's rude and it does the job pretty fine.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] gtk2hs, TreeView and CellRenderer

2009-05-28 Thread Zefirov Sergey
As far as I can tell, there is not way to create CellRenderer that could 
include icons with the text (I think it is because of incompatibilities of 
Haskell type classes and Gtk inheritance hierarchy).

Is it a good way to use CellRendererPixbuf and render to pixbufs using Cairo? I 
think my TreeView hierarchies won't be very big, around 1000 items for a while.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe