Re: [Haskell-cafe] testing par with simple program
paolo.veronelli: Hi, reading a previous thread I got interested. I simplified the example pointed by dons in import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) compiled with ghc --make prova -O2 -threaded timings paol...@paolino-casa:~$ time ./prova +RTS -N1 32762 real0m7.031s user0m6.304s sys0m0.004s paol...@paolino-casa:~$ time ./prova +RTS -N2 32762 real0m6.997s user0m6.728s sys0m0.020s paol...@paolino-casa:~$ without optimizations it gets worse paol...@paolino-casa:~$ time ./prova +RTS -N1 32762 real1m20.706s user1m18.197s sys0m0.104s paol...@paolino-casa:~$ time ./prova +RTS -N2 32762 real1m38.927s user1m45.039s sys0m0.536s paol...@paolino-casa:~$ staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50% thanks for comments Firstly, a and b are identical, so GHC commons them up. The compiler transforms it into: a `par` a `seq` print (a + a) So you essentially fork a spark to evaluate 'a', and then have the main thread also evaluate 'a' again. One of them wins, then you add the result to itself. The runtime may choose not to convert your first spark into a thread. Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr SPARKS: 1 (0 converted, 0 pruned) That indeed, it doesn't convert your `par` into a real thread. While, for example, the helloworld on the wiki: http://haskell.org/haskellwiki/Haskell_in_5_steps Converts 2 sparks to 2 theads: SPARKS: 2 (2 converted, 0 pruned) ./B +RTS -threaded -N2 -sstderr 2.13s user 0.04s system 137% cpu 1.570 total -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] testing par with simple program
A better test program import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = fib 39 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2) running it , these are the results paol...@paolino-casa:~$ ghc --make prova -threaded [1 of 1] Compiling Main ( prova.hs, prova.o ) Linking prova ... paol...@paolino-casa:~$ time ./prova +RTS -N1 63262367 real1m17.485s user1m16.473s sys0m0.392s paol...@paolino-casa:~$ time ./prova +RTS -N2 63262367 real1m20.186s user1m31.554s sys0m0.600s paol...@paolino-casa:~$ touch prova.hs paol...@paolino-casa:~$ ghc --make prova -O2 -threaded [1 of 1] Compiling Main ( prova.hs, prova.o ) Linking prova ... paol...@paolino-casa:~$ time ./prova +RTS -N1 63262367 real0m17.652s user0m15.277s sys0m0.108s paol...@paolino-casa:~$ time ./prova +RTS -N2 63262367 real0m13.650s user0m15.121s sys0m0.188s From the resource graph and the timings it is clear that the program is not able to use all the 2 cores powers, considering computing 'a' alone is about 7 seconds and 'b' alone 9. What is retaining the cpu's to run full power ? paolino 2009/8/21 Don Stewart d...@galois.com paolo.veronelli: Hi, reading a previous thread I got interested. I simplified the example pointed by dons in import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = ack 3 11 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) compiled with ghc --make prova -O2 -threaded timings paol...@paolino-casa:~$ time ./prova +RTS -N1 32762 real0m7.031s user0m6.304s sys0m0.004s paol...@paolino-casa:~$ time ./prova +RTS -N2 32762 real0m6.997s user0m6.728s sys0m0.020s paol...@paolino-casa:~$ without optimizations it gets worse paol...@paolino-casa:~$ time ./prova +RTS -N1 32762 real1m20.706s user1m18.197s sys0m0.104s paol...@paolino-casa:~$ time ./prova +RTS -N2 32762 real1m38.927s user1m45.039s sys0m0.536s paol...@paolino-casa:~$ staring at the resource usage graph I can see it does use 2 cores when told to do it, but with -N1 the used cpu goes 100% and with -N2 they both run just over 50% thanks for comments Firstly, a and b are identical, so GHC commons them up. The compiler transforms it into: a `par` a `seq` print (a + a) So you essentially fork a spark to evaluate 'a', and then have the main thread also evaluate 'a' again. One of them wins, then you add the result to itself. The runtime may choose not to convert your first spark into a thread. Running with a 2009 GHC head snapshot, we can see with +RTS -sstderr SPARKS: 1 (0 converted, 0 pruned) That indeed, it doesn't convert your `par` into a real thread. While, for example, the helloworld on the wiki: http://haskell.org/haskellwiki/Haskell_in_5_steps Converts 2 sparks to 2 theads: SPARKS: 2 (2 converted, 0 pruned) ./B +RTS -threaded -N2 -sstderr 2.13s user 0.04s system 137% cpu 1.570 total -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] testing par with simple program
paolo.veronelli: A better test program import Control.Parallel main = a `par` b `pseq` print (a + b ) where a = ack 3 11 b = fib 39 ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2) running it , these are the results From the resource graph and the timings it is clear that the program is not able to use all the 2 cores powers, considering computing 'a' alone is about 7 seconds and 'b' alone 9. What is retaining the cpu's to run full power ? Some advice on analysing parallel programs: * Read Simon Marlow's excellent paper: Runtime Support for Multicore Haskell http://ghcmutterings.wordpress.com/2009/03/03/new-paper-runtime-support-for-multicore-haskell/ * Install GHC Head from 2009, it includes spark profiling. * Never bother measuring without using -O or -O2 -- unoptimized code is just not useful to think about. * Compile your program: $ ghc-6.11.20090228 -O2 --make -threaded A.hs [1 of 1] Compiling Main ( A.hs, A.o ) Linking A ... * Run it with +RTS -sstderr $ ./A +RTS -N2 -sstderr ./A +RTS -N2 -sstderr 63262367 ... Generation 0: 12021 collections, 0 parallel, 1.33s, 1.42s elapsed Generation 1: 2 collections, 1 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.03 (426 / 414, ideal 2) SPARKS: 1 (1 converted, 0 pruned) Productivity 90.7% of total user, 126.4% of total elapsed And we see 3 things: * It did use the parallel GC : good! * It did convert 1 spark to a thread: good! * Productivity was greater than 100%: good! So this looks fine! So I suspect you're falling into the ghc 6.10.x series bug where 3 sparks would not be converted. Try with either 3 sparks, or using GHC 6.11.x series. -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe