Re: [Haskell-cafe] testing par with simple program

2009-08-21 Thread Don Stewart
paolo.veronelli:
 Hi, reading a previous thread I got interested.
 I simplified the example pointed by dons in
 
 import Control.Parallel
  
 main = a `par` b  `pseq` print (a + b )
 where
 a = ack 3 11
 b = ack 3 11

 ack 0 n = n+1
 ack m 0 = ack (m-1) 1
 ack m n = ack (m-1) (ack m (n-1))
 
 compiled with
 ghc --make prova  -O2 -threaded
 
 timings
 paol...@paolino-casa:~$ time ./prova +RTS -N1
 32762
 
 real0m7.031s
 user0m6.304s
 sys0m0.004s
 paol...@paolino-casa:~$ time ./prova +RTS -N2
 32762
 
 real0m6.997s
 user0m6.728s
 sys0m0.020s
 paol...@paolino-casa:~$
 
 without optimizations it gets worse
 
 paol...@paolino-casa:~$ time ./prova +RTS -N1
 32762
 
 real1m20.706s
 user1m18.197s
 sys0m0.104s
 paol...@paolino-casa:~$ time ./prova +RTS -N2
 32762
 
 real1m38.927s
 user1m45.039s
 sys0m0.536s
 paol...@paolino-casa:~$
 
 staring at the resource usage graph I can see it does use 2 cores when told to
 do it, but with -N1 the used cpu goes 100% and with -N2 they both run just 
 over
 50%
 
 thanks for comments


Firstly, a and b are identical, so GHC commons them up. The compiler
transforms it into:

a `par` a `seq` print (a + a)

So you essentially fork a spark to evaluate 'a', and then have the main
thread also evaluate 'a' again. One of them wins, then you add the
result to itself. The runtime may choose not to convert your first spark
into a thread.

Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr

  SPARKS: 1 (0 converted, 0 pruned)

That indeed, it doesn't convert your `par` into a real thread.

While, for example, the helloworld on the wiki:

http://haskell.org/haskellwiki/Haskell_in_5_steps

Converts 2 sparks to 2 theads:

SPARKS: 2 (2 converted, 0 pruned)
./B +RTS -threaded -N2 -sstderr  2.13s user 0.04s system 137% cpu 1.570 
total 

-- Don

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] testing par with simple program

2009-08-21 Thread Paolino
A better test program

import Control.Parallel

main = a `par` b  `pseq` print (a + b )
where
a = ack 3 11
b = fib 39

ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))

fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

running it , these are the results

paol...@paolino-casa:~$ ghc --make prova   -threaded
[1 of 1] Compiling Main ( prova.hs, prova.o )
Linking prova ...
paol...@paolino-casa:~$ time ./prova +RTS -N1
63262367

real1m17.485s
user1m16.473s
sys0m0.392s
paol...@paolino-casa:~$ time ./prova +RTS -N2
63262367

real1m20.186s
user1m31.554s
sys0m0.600s
paol...@paolino-casa:~$ touch prova.hs
paol...@paolino-casa:~$ ghc --make prova -O2  -threaded
[1 of 1] Compiling Main ( prova.hs, prova.o )
Linking prova ...
paol...@paolino-casa:~$ time ./prova +RTS -N1
63262367

real0m17.652s
user0m15.277s
sys0m0.108s
paol...@paolino-casa:~$ time ./prova +RTS -N2
63262367

real0m13.650s
user0m15.121s
sys0m0.188s

From the resource graph and the timings it is clear that the program is not
able to use all the 2 cores powers, considering computing 'a' alone is about
7 seconds and 'b' alone 9.
What is retaining the cpu's to run full power ?


paolino

2009/8/21 Don Stewart d...@galois.com

 paolo.veronelli:
  Hi, reading a previous thread I got interested.
  I simplified the example pointed by dons in
 
  import Control.Parallel
 
  main = a `par` b  `pseq` print (a + b )
  where
  a = ack 3 11
  b = ack 3 11
 
  ack 0 n = n+1
  ack m 0 = ack (m-1) 1
  ack m n = ack (m-1) (ack m (n-1))
 
  compiled with
  ghc --make prova  -O2 -threaded
 
  timings
  paol...@paolino-casa:~$ time ./prova +RTS -N1
  32762
 
  real0m7.031s
  user0m6.304s
  sys0m0.004s
  paol...@paolino-casa:~$ time ./prova +RTS -N2
  32762
 
  real0m6.997s
  user0m6.728s
  sys0m0.020s
  paol...@paolino-casa:~$
 
  without optimizations it gets worse
 
  paol...@paolino-casa:~$ time ./prova +RTS -N1
  32762
 
  real1m20.706s
  user1m18.197s
  sys0m0.104s
  paol...@paolino-casa:~$ time ./prova +RTS -N2
  32762
 
  real1m38.927s
  user1m45.039s
  sys0m0.536s
  paol...@paolino-casa:~$
 
  staring at the resource usage graph I can see it does use 2 cores when
 told to
  do it, but with -N1 the used cpu goes 100% and with -N2 they both run
 just over
  50%
 
  thanks for comments


 Firstly, a and b are identical, so GHC commons them up. The compiler
 transforms it into:

a `par` a `seq` print (a + a)

 So you essentially fork a spark to evaluate 'a', and then have the main
 thread also evaluate 'a' again. One of them wins, then you add the
 result to itself. The runtime may choose not to convert your first spark
 into a thread.

 Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr

  SPARKS: 1 (0 converted, 0 pruned)

 That indeed, it doesn't convert your `par` into a real thread.

 While, for example, the helloworld on the wiki:

http://haskell.org/haskellwiki/Haskell_in_5_steps

 Converts 2 sparks to 2 theads:

SPARKS: 2 (2 converted, 0 pruned)
./B +RTS -threaded -N2 -sstderr  2.13s user 0.04s system 137% cpu 1.570
 total

 -- Don


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] testing par with simple program

2009-08-21 Thread Don Stewart
paolo.veronelli:
 A better test program
 
 import Control.Parallel
  
 main = a `par` b  `pseq` print (a + b )
 where
 a = ack 3 11
 b = fib 39

 ack 0 n = n+1
 ack m 0 = ack (m-1) 1
 ack m n = ack (m-1) (ack m (n-1))
 
 fib 0 = 0
 fib 1 = 1
 fib n = fib (n-1) + fib (n-2)
 
 running it , these are the results
 
 From the resource graph and the timings it is clear that the program is not
 able to use all the 2 cores powers, considering computing 'a' alone is about 7
 seconds and 'b' alone 9.
 What is retaining the cpu's to run full power ?
 

Some advice on analysing parallel programs:

* Read Simon Marlow's excellent paper: 
Runtime Support for Multicore Haskell

http://ghcmutterings.wordpress.com/2009/03/03/new-paper-runtime-support-for-multicore-haskell/

* Install GHC Head from 2009, it includes spark profiling.

* Never bother measuring without using -O or -O2 -- unoptimized code
  is just not useful to think about.

* Compile your program:

$ ghc-6.11.20090228 -O2 --make -threaded A.hs
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...

* Run it with +RTS -sstderr

$ ./A +RTS -N2 -sstderr
./A +RTS -N2 -sstderr 
63262367

...

  Generation 0: 12021 collections, 0 parallel,  1.33s,  1.42s 
elapsed
  Generation 1: 2 collections, 1 parallel,  0.00s,  0.00s 
elapsed

  Parallel GC work balance: 1.03 (426 / 414, ideal 2)

  SPARKS: 1 (1 converted, 0 pruned)

  Productivity  90.7% of total user, 126.4% of total elapsed

And we see 3 things:

* It did use the parallel GC : good!
* It did convert 1 spark to a thread: good!
* Productivity was greater than 100%: good!

So this looks fine!


So I suspect you're falling into the ghc 6.10.x series bug where  3 sparks 
would not be converted.
Try with either 3 sparks, or using GHC 6.11.x series.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe