Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-09 Thread Chris Yuen
Hi all,

Thanks Bryan, reading your clean code was good for my Haskell health :)

I took your code and did some more research. I think I have found the
answer. I have written an extensive analysis in my blog post
http://cfc.kizzx2.com/index.php/in-search-of-performance-in-haskell/(comments
are very much welcome, btw :)

Here are summaries of key points:

- I was using GHC 32-bit. Int is 32-bit there, so I needed Int64. It turns
out 64-bit operations in 32-bit programs are just darn slow. Maybe it's a
Windows problem. On Linux 64 bit GHC Int is 64 bit so everything just works.
Changing Int64 to Int liberates me from many `fromIntegral` which saved 20%
- Changing `divMod` to `quotRem` saved another 20%
- Using `Data.Vector.Unboxed` and `unsafeIndex` saved another 15% or so
- Moving the length arrays to `where` clause in `solve` with bang patterns
on them save some more.

This was a great learning experience! Now I have more questions :P

1. Why are bangs needed on the length arrays?

If I remove them from below, performance drops 10%. I thought `unsafeIndex`
is straight in both arguments, no?

wordLength i = go i
  where
go n
  | n  10 = lengthOnes !! n
  | n  20 = lengthTeens !! (n-10)
  | n  100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10))
  | n  1000 = (lengthOnes !! (n // 100)) + 7 + go (n % 100)
  | n  100 = go (n // 1000) + 8 + go (n % 1000)
  | otherwise = go (n // 100) + 7 + go (n % 100)
!lengthOnes = lengthVec ones
!lengthTens = lengthVec tens
!lengthTeens = lengthVec teens

2. Why the single element worker wrapper pattern (`go` functions) increases
performance?

If we change wordLength to

wordLength n
  | n  10 = lengthOnes !! n
  | n  20 = lengthTeens !! (n-10)
  | n  100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10))
  | n  1000 = (lengthOnes !! (n // 100)) + 7 + wordLength (n % 100)
  | n  100 = wordLength (n // 1000) + 8 + wordLength (n % 1000)
  | otherwise = wordLength (n // 100) + 7 + wordLength (n % 100)
  where
!lengthOnes = lengthVec ones
!lengthTens = lengthVec tens
!lengthTeens = lengthVec teens

The performance drops by another 10%. This really surprised me. `go i`
seemed obvious to me and I don't understand how it could make any
difference. The full source code is available to GHC so it shouldn't be
related to call-by-pointer problem? If this is the case, shouldn't we always
wrap a go function for **any** recursive functions?

Thanks!

Chris

On Tue, Aug 9, 2011 at 9:09 AM, Reiner Pope reiner.p...@gmail.com wrote:

 On 9 August 2011 10:06, Bryan O'Sullivan b...@serpentine.com wrote:

 On Mon, Aug 8, 2011 at 9:24 AM, Chris Yuen kizzx2+hask...@gmail.comwrote:


 For reference I have asked the same question on StackOverflow. One person
 suggested that the reason might be that Int64 on Windows is broken (
 http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-haskell-program/6976448#6976448
 ).


 No, they're barking up the wrong tree.

 I've put an idiomatic Haskell translation of your C++ algorithm at
 https://gist.github.com/1133048#file_wordy.hs

 (I've also included a copy of your original C++, with a bug fixed, in the
 same gist.)

 As you can see, the two are almost identical. Not surprisingly, each one
 spends the bulk of its time computing word lengths.

 GHC simply doesn't do a great job of compiling fairly tight code like
 this. gcc generates about 100 lines of assembly that's mostly easy to follow
 (except for some bit-twiddling tricks to avoid div instructions). Although
 the Core it generates looks fine, GHC spends quite a bit of time in its
 generated assembly on what looks to me like STG housekeeping (it spends only
 0.3% of its time in the garbage collector, because it doesn't allocate
 memory). The overall result is that the Haskell code runs about 5x more
 slowly than the C++ code.


 GHC generating bad assembly suggests trying the llvm codegen (see
 http://donsbot.wordpress.com/2010/02/21/smoking-fast-haskell-code-using-ghcs-new-llvm-codegen/).
 Compiling Bryan's code with

 $ ghc -O2 -fllvm Wordy.hs

 it now runs only 2x slower than the C++ code.

 Reiner

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-09 Thread Johan Tibell
Hi Chris,

On Tue, Aug 9, 2011 at 12:47 PM, Chris Yuen kizzx2+hask...@gmail.com wrote:
 1. Why are bangs needed on the length arrays?

 If I remove them from below, performance drops 10%. I thought `unsafeIndex`
 is straight in both arguments, no?

 wordLength i = go i
   where
     go n
   | n  10 = lengthOnes !! n
   | n  20 = lengthTeens !! (n-10)
   | n  100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10))
   | n  1000 = (lengthOnes !! (n // 100)) + 7 + go (n % 100)
   | n  100 = go (n // 1000) + 8 + go (n % 1000)
   | otherwise = go (n // 100) + 7 + go (n % 100)
     !lengthOnes = lengthVec ones
     !lengthTens = lengthVec tens
     !lengthTeens = lengthVec teens

(It's strict, not straight.)

The different lengths are not used in all branches and since Haskell
is a lazy (or to be pendantic: non-strict) language we cannot compute
them before knowing which branch will be evaluated. For example, given
that we have

ones = ...
tens = error Boom!

test = wordLength 0

evaluating 'test' should not cause an exception to be raised as the
first (n  10) branch is taken, but it would if lengthOnes was strict.

Delaying the evaluation has some costs, namely allocating a thunk for
e.g. `lengthVec ones` and later evaluate that thunk. By making the
lengths strict we can evaluate them earlier and avoid some allocation
and forcing of thunks.

 2. Why the single element worker wrapper pattern (`go` functions) increases
 performance?

 If we change wordLength to

 wordLength n
   | n  10 = lengthOnes !! n
   | n  20 = lengthTeens !! (n-10)
   | n  100 = (lengthTens !! (n // 10)) + (lengthOnes !! (n % 10))
   | n  1000 = (lengthOnes !! (n // 100)) + 7 + wordLength (n % 100)
   | n  100 = wordLength (n // 1000) + 8 + wordLength (n % 1000)
   | otherwise = wordLength (n // 100) + 7 + wordLength (n % 100)
   where
     !lengthOnes = lengthVec ones
     !lengthTens = lengthVec tens
     !lengthTeens = lengthVec teens

 The performance drops by another 10%. This really surprised me. `go i`
 seemed obvious to me and I don't understand how it could make any
 difference. The full source code is available to GHC so it shouldn't be
 related to call-by-pointer problem? If this is the case, shouldn't we always
 wrap a go function for **any** recursive functions?

Making wordLength non-recursive lets GHC inline it, which can
sometimes help performance (e.g. if the inlining enables more
optimizations). Inlining does increase code size (and sometimes
allocation if a closure has to be allocated to capture free
variables), so it's not always a good idea.

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-09 Thread Bryan O'Sullivan
On Tue, Aug 9, 2011 at 9:47 AM, Chris Yuen kizzx2+hask...@gmail.com wrote:


 - I was using GHC 32-bit. Int is 32-bit there, so I needed Int64. It turns
 out 64-bit operations in 32-bit programs are just darn slow. Maybe it's a
 Windows problem.


No, GHC calls out to C for 64-bit integer ops on all 32-bit platforms.


 On Linux 64 bit GHC Int is 64 bit so everything just works. Changing Int64
 to Int liberates me from many `fromIntegral` which saved 20%


Actually, fromIntegral is usually a no-op, so chances are you're seeing the
effect of something else.


 - Changing `divMod` to `quotRem` saved another 20%


It's cheaper again to use quotInt# and remInt# as I did in my code.


 1. Why are bangs needed on the length arrays?


GHC has to deconstruct the Vector in order to get at the real underlying
array, so that unsafeIndex can perform the actual index into the real
underlying array.

Without bang patterns, the code performs at least one deconstruction on
every iteration through the loop. Each deconstruction has a cost. With the
bang patterns, they're all hoisted out and performed just once.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-09 Thread Chris Yuen

 It's cheaper again to use quotInt# and remInt# as I did in my code.


Just want to point out that this is actually surprisingly less of an impact
than I thought. As I did the step-by-step break down (you can see my blog
post), this one (changing `quotRem` to `quotInt#` and `remInt#`) yielded
almost negligible differences. Every other change had more impact than this,
including turning the `Data.Array.Unboxed` into lookup functions (like
`lenOnes 0 = 0; lenOnes 1 = 3; lenOnes 2 = 3; ...`)

Chris

On Wed, Aug 10, 2011 at 2:16 AM, Bryan O'Sullivan b...@serpentine.comwrote:

 On Tue, Aug 9, 2011 at 9:47 AM, Chris Yuen kizzx2+hask...@gmail.comwrote:


 - I was using GHC 32-bit. Int is 32-bit there, so I needed Int64. It turns
 out 64-bit operations in 32-bit programs are just darn slow. Maybe it's a
 Windows problem.


 No, GHC calls out to C for 64-bit integer ops on all 32-bit platforms.


 On Linux 64 bit GHC Int is 64 bit so everything just works. Changing Int64
 to Int liberates me from many `fromIntegral` which saved 20%


 Actually, fromIntegral is usually a no-op, so chances are you're seeing the
 effect of something else.


 - Changing `divMod` to `quotRem` saved another 20%


 It's cheaper again to use quotInt# and remInt# as I did in my code.


 1. Why are bangs needed on the length arrays?


 GHC has to deconstruct the Vector in order to get at the real underlying
 array, so that unsafeIndex can perform the actual index into the real
 underlying array.

 Without bang patterns, the code performs at least one deconstruction on
 every iteration through the loop. Each deconstruction has a cost. With the
 bang patterns, they're all hoisted out and performed just once.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-08 Thread Chris Yuen
Where is the `unsafeAt` function? I can't seem to find it (
http://haskell.org/hoogle/?hoogle=unsafeat).

For reference I have asked the same question on StackOverflow. One person
suggested that the reason might be that Int64 on Windows is broken (
http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-haskell-program/6976448#6976448
).

I tried the same test on Arch Linux x64 (GHC 7.0.3) but it still can't
complete in 3 minutes, where as a new C++ version I wrote completes in 45
seconds (because I didn't want to use Mono for benchmarks. For reference
here is the C++ implementation http://ideone.com/vZGhh (Again, ironically
shorter than Haskell and actually looks quite clean))

The profile under x64 Linux is similar to the one posted before -- most
allocations and time spent in wordLength'.

It seems mysterious that such an innocent program is so obscure to write
correctly in Haskell :P

Chris

On Mon, Aug 8, 2011 at 1:40 AM, Eugene Kirpichov ekirpic...@gmail.comwrote:

 What about using unsafe array indexing operations? (i.e. array `unsafeAt`
 index)

 2011/8/7 Chris Yuen kizzx2+hask...@gmail.com:
  Here is an updated version using Data.Array.Unboxed
 http://ideone.com/YXuVL
  And the profile http://hpaste.org/49940
 
  Still taking 5+ minutes...
 
  Chris
 
  On Sun, Aug 7, 2011 at 5:20 PM, Daniel Fischer
  daniel.is.fisc...@googlemail.com wrote:
 
  On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:
   In short I don't see how to get further without changing the algorithm
   or doing some hacks like manual unrolling. Maybe someone else has some
   ideas?
 
  Well, the C# implementation uses arrays for lookup while the Haskell
  version uses list lookups
 
   in (tens !! fromIntegral t) ++ wordify x
 
  and case'd functions
 
  lenTens 0 = 0
  lenTens 1 = 3
  lenTens 2 = 6
  lenTens 3 = 6
  lenTens 4 = 5
  lenTens 5 = 5
  lenTens 6 = 5
  lenTens 7 = 7
  lenTens 8 = 6
  lenTens 9 = 6
 
  wordify is only called once at the end, so that should not have a
  measurable impact, but the lenXXXs might.
  I'm not sure what
 
  CaseLen.$wlenTens :: GHC.Prim.Int# - GHC.Prim.Int#
  [GblId,
   Arity=1,
   Str=DmdType L,
   Unf=Unf{Src=vanilla, TopLvl=True, Arity=1, Value=True,
  ConLike=True, Cheap=True, Expandable=True,
  Guidance=IF_ARGS [12] 11 0}]
  CaseLen.$wlenTens =
   \ (ww_shY :: GHC.Prim.Int#) -
 case ww_shY of _ {
   __DEFAULT -
 CaseLen.lenTens1
 `cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int#
 :: GHC.Types.Int ~ GHC.Prim.Int#);
   0 - 0;
   1 - 3;
   2 - 6;
   3 - 6;
   4 - 5;
   5 - 5;
   6 - 5;
   7 - 7;
   8 - 6;
   9 - 6
 }
 
  means at a lower level, but it's certainly worth trying out whether an
  unboxed array lookup is faster.
 
 
  ___
  Haskell-Cafe mailing list
  Haskell-Cafe@haskell.org
  http://www.haskell.org/mailman/listinfo/haskell-cafe
 
 



 --
 Eugene Kirpichov
 Principal Engineer, Mirantis Inc. http://www.mirantis.com/
 Editor, http://fprog.ru/

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-08 Thread Daniel Fischer
On Monday 08 August 2011, 18:24:45, Chris Yuen wrote:
 Where is the `unsafeAt` function?

Data.Array.Base

 I can't seem to find it (
 http://haskell.org/hoogle/?hoogle=unsafeat).

Data.Array.Base is not haddocked (there's a reason for that), so hoogle 
doesn't know about its functions.

 
 For reference I have asked the same question on StackOverflow. One
 person suggested that the reason might be that Int64 on Windows is
 broken (
 http://stackoverflow.com/questions/6970904/analyzing-slow-performance-o
 f-a-haskell-program/6976448#6976448 ).

As far as I know, there's no 64-bit GHC for windows, so that might well 
have performance impact even on 64-bit windows.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-08 Thread Bryan O'Sullivan
On Mon, Aug 8, 2011 at 9:24 AM, Chris Yuen kizzx2+hask...@gmail.com wrote:


 For reference I have asked the same question on StackOverflow. One person
 suggested that the reason might be that Int64 on Windows is broken (
 http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-haskell-program/6976448#6976448
 ).


No, they're barking up the wrong tree.

I've put an idiomatic Haskell translation of your C++ algorithm at
https://gist.github.com/1133048#file_wordy.hs

(I've also included a copy of your original C++, with a bug fixed, in the
same gist.)

As you can see, the two are almost identical. Not surprisingly, each one
spends the bulk of its time computing word lengths.

GHC simply doesn't do a great job of compiling fairly tight code like this.
gcc generates about 100 lines of assembly that's mostly easy to follow
(except for some bit-twiddling tricks to avoid div instructions). Although
the Core it generates looks fine, GHC spends quite a bit of time in its
generated assembly on what looks to me like STG housekeeping (it spends only
0.3% of its time in the garbage collector, because it doesn't allocate
memory). The overall result is that the Haskell code runs about 5x more
slowly than the C++ code.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-08 Thread Reiner Pope
On 9 August 2011 10:06, Bryan O'Sullivan b...@serpentine.com wrote:

 On Mon, Aug 8, 2011 at 9:24 AM, Chris Yuen kizzx2+hask...@gmail.comwrote:


 For reference I have asked the same question on StackOverflow. One person
 suggested that the reason might be that Int64 on Windows is broken (
 http://stackoverflow.com/questions/6970904/analyzing-slow-performance-of-a-haskell-program/6976448#6976448
 ).


 No, they're barking up the wrong tree.

 I've put an idiomatic Haskell translation of your C++ algorithm at
 https://gist.github.com/1133048#file_wordy.hs

 (I've also included a copy of your original C++, with a bug fixed, in the
 same gist.)

 As you can see, the two are almost identical. Not surprisingly, each one
 spends the bulk of its time computing word lengths.

 GHC simply doesn't do a great job of compiling fairly tight code like this.
 gcc generates about 100 lines of assembly that's mostly easy to follow
 (except for some bit-twiddling tricks to avoid div instructions). Although
 the Core it generates looks fine, GHC spends quite a bit of time in its
 generated assembly on what looks to me like STG housekeeping (it spends only
 0.3% of its time in the garbage collector, because it doesn't allocate
 memory). The overall result is that the Haskell code runs about 5x more
 slowly than the C++ code.


GHC generating bad assembly suggests trying the llvm codegen (see
http://donsbot.wordpress.com/2010/02/21/smoking-fast-haskell-code-using-ghcs-new-llvm-codegen/).
Compiling Bryan's code with

$ ghc -O2 -fllvm Wordy.hs

it now runs only 2x slower than the C++ code.

Reiner
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-07 Thread Max Bolingbroke
On 7 August 2011 06:15, Chris Yuen kizzx2+hask...@gmail.com wrote:
 I am mainly interested in making the Haskell version perform
 comparatively to the C# version. Right now it is at least 5x slower so
 obviously I am missing something obvious)

You have a map call which is immediately consumed by solve. GHCs
fusion wont' help you because solve is not defined in terms of
foldr. Fusing this manually (http://hpaste.org/49936) you can get 10%
improvement.

Another source of problems is the divMod calls. There are two issues:
  1. The result of the divMod is a pair of boxed Int64s. This can be
worked around by using div and mod seperately instead, but that is
actually slower even though it avoids the boxing.
  2. The divMod is checked: i.e. it throws a Haskell exception if
the first argument is minBound or the second is 0. This means that
divMod does two equality checks and one unboxing operation (which will
just be an always-taken branch, thanks to pointer tagging) before it
actually reaches GHC.Base.divInt#

If I use divInt# and modInt# directly like so:

{{{
wordLength' :: Int64 - Int64 - Int64
wordLength' !pad !n@(I64# n#)
| n  10 = lenOnes n + pad
| n  20 = lenTeens (n-10) + pad
| n  100= splitterTen
| n  1000   = splitter 100 7
| n  100= splitter 1000 8
| otherwise  = splitter 100 7
where
splitterTen = let -- !(!t, !x) =  n `divMod` 10
  t = n# `divInt#` 10#
  x = n# `modInt#` 10#
  in wordLength' (lenTens (I64# t) + pad) (I64# x)
splitter !(I# d#) !suffix = let -- !(!t, !x) = n `divMod` d
  t = n# `divInt#` d#
  x = n# `modInt#` d#
  in wordLength' (wordLength' (suffix+pad)
(I64# t)) (I64# x)
}}}

We sacrifice these checks but the code gets 25% faster again.

I can't see anything else obviously wrong with the core, so the
remaining issues are likely to be things like loop unrolling, turning
div by a constant int divisor into a multiply and other code
generation issues. I tried -fllvm but it has no effect. At a guess,
this is because optimisations are impeded by the call to stg_gc_fun in
the stack check that solve makes.

In short I don't see how to get further without changing the algorithm
or doing some hacks like manual unrolling. Maybe someone else has some
ideas?

Max

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-07 Thread Daniel Fischer
On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:
 In short I don't see how to get further without changing the algorithm
 or doing some hacks like manual unrolling. Maybe someone else has some
 ideas?

Well, the C# implementation uses arrays for lookup while the Haskell 
version uses list lookups

  in (tens !! fromIntegral t) ++ wordify x

and case'd functions

lenTens 0 = 0
lenTens 1 = 3
lenTens 2 = 6
lenTens 3 = 6
lenTens 4 = 5
lenTens 5 = 5
lenTens 6 = 5
lenTens 7 = 7
lenTens 8 = 6
lenTens 9 = 6
 
wordify is only called once at the end, so that should not have a 
measurable impact, but the lenXXXs might.
I'm not sure what

CaseLen.$wlenTens :: GHC.Prim.Int# - GHC.Prim.Int#
[GblId,
 Arity=1,
 Str=DmdType L,
 Unf=Unf{Src=vanilla, TopLvl=True, Arity=1, Value=True,
 ConLike=True, Cheap=True, Expandable=True,
 Guidance=IF_ARGS [12] 11 0}]
CaseLen.$wlenTens =
  \ (ww_shY :: GHC.Prim.Int#) -
case ww_shY of _ {
  __DEFAULT -
CaseLen.lenTens1
`cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int#
:: GHC.Types.Int ~ GHC.Prim.Int#);
  0 - 0;
  1 - 3;
  2 - 6;
  3 - 6;
  4 - 5;
  5 - 5;
  6 - 5;
  7 - 7;
  8 - 6;
  9 - 6
}

means at a lower level, but it's certainly worth trying out whether an 
unboxed array lookup is faster.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-07 Thread Chris Yuen
Here is an updated version using Data.Array.Unboxed  http://ideone.com/YXuVL
And the profile http://hpaste.org/49940

Still taking 5+ minutes...

Chris

On Sun, Aug 7, 2011 at 5:20 PM, Daniel Fischer 
daniel.is.fisc...@googlemail.com wrote:

 On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:
  In short I don't see how to get further without changing the algorithm
  or doing some hacks like manual unrolling. Maybe someone else has some
  ideas?

 Well, the C# implementation uses arrays for lookup while the Haskell
 version uses list lookups

  in (tens !! fromIntegral t) ++ wordify x

 and case'd functions

 lenTens 0 = 0
 lenTens 1 = 3
 lenTens 2 = 6
 lenTens 3 = 6
 lenTens 4 = 5
 lenTens 5 = 5
 lenTens 6 = 5
 lenTens 7 = 7
 lenTens 8 = 6
 lenTens 9 = 6

 wordify is only called once at the end, so that should not have a
 measurable impact, but the lenXXXs might.
 I'm not sure what

 CaseLen.$wlenTens :: GHC.Prim.Int# - GHC.Prim.Int#
 [GblId,
  Arity=1,
  Str=DmdType L,
  Unf=Unf{Src=vanilla, TopLvl=True, Arity=1, Value=True,
 ConLike=True, Cheap=True, Expandable=True,
 Guidance=IF_ARGS [12] 11 0}]
 CaseLen.$wlenTens =
  \ (ww_shY :: GHC.Prim.Int#) -
case ww_shY of _ {
  __DEFAULT -
CaseLen.lenTens1
`cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int#
:: GHC.Types.Int ~ GHC.Prim.Int#);
  0 - 0;
  1 - 3;
  2 - 6;
  3 - 6;
  4 - 5;
  5 - 5;
  6 - 5;
  7 - 7;
  8 - 6;
  9 - 6
}

 means at a lower level, but it's certainly worth trying out whether an
 unboxed array lookup is faster.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-07 Thread Eugene Kirpichov
What about using unsafe array indexing operations? (i.e. array `unsafeAt` index)

2011/8/7 Chris Yuen kizzx2+hask...@gmail.com:
 Here is an updated version using Data.Array.Unboxed  http://ideone.com/YXuVL
 And the profile http://hpaste.org/49940

 Still taking 5+ minutes...

 Chris

 On Sun, Aug 7, 2011 at 5:20 PM, Daniel Fischer
 daniel.is.fisc...@googlemail.com wrote:

 On Sunday 07 August 2011, 10:52:20, Max Bolingbroke wrote:
  In short I don't see how to get further without changing the algorithm
  or doing some hacks like manual unrolling. Maybe someone else has some
  ideas?

 Well, the C# implementation uses arrays for lookup while the Haskell
 version uses list lookups

                      in (tens !! fromIntegral t) ++ wordify x

 and case'd functions

 lenTens 0 = 0
 lenTens 1 = 3
 lenTens 2 = 6
 lenTens 3 = 6
 lenTens 4 = 5
 lenTens 5 = 5
 lenTens 6 = 5
 lenTens 7 = 7
 lenTens 8 = 6
 lenTens 9 = 6

 wordify is only called once at the end, so that should not have a
 measurable impact, but the lenXXXs might.
 I'm not sure what

 CaseLen.$wlenTens :: GHC.Prim.Int# - GHC.Prim.Int#
 [GblId,
  Arity=1,
  Str=DmdType L,
  Unf=Unf{Src=vanilla, TopLvl=True, Arity=1, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=IF_ARGS [12] 11 0}]
 CaseLen.$wlenTens =
  \ (ww_shY :: GHC.Prim.Int#) -
    case ww_shY of _ {
      __DEFAULT -
        CaseLen.lenTens1
        `cast` (CoUnsafe GHC.Types.Int GHC.Prim.Int#
                :: GHC.Types.Int ~ GHC.Prim.Int#);
      0 - 0;
      1 - 3;
      2 - 6;
      3 - 6;
      4 - 5;
      5 - 5;
      6 - 5;
      7 - 7;
      8 - 6;
      9 - 6
    }

 means at a lower level, but it's certainly worth trying out whether an
 unboxed array lookup is faster.


 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe





-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Analyzing slow performance of a Haskell program

2011-08-06 Thread Chris Yuen
Hello Cafe,

I was trying to solve ITA Software's Word Nubmers puzzle (
http://www.itasoftware.com/careers/puzzle_archive.html) using a brute force
approach.

Here's a my version of a quick recap of the problem:

 A wordNumber is defined as

 wordNumber 1 = one
 wordNumber 2 = onetwo
 wordNumber 3 = onethree
 wordNumber 15 =
onetwothreefourfivesixseveneightnineteneleventwelvethirteenfourteenfifteen
 ...

 Problem: Find the 51-billion-th letter of (wordNumber Infinity); assume
that letter is found at 'wordNumber x', also find 'sum [1..x]'

From an imperative perspective, a naive algorithm would be to have 2
counters, keep counting the length of each wordNumber and break to return
the result.

The imperative brute-force approach is implemented in C# here:
http://ideone.com/JjCb3. It takes about 1.5 minutes to find the answer on my
computer.

Then I implemented a brute-force Haskell version: http://ideone.com/ngfFq.
It cannot finish the calculation in 5 minutes on my machine. (Irony: it's
has more lines than the C# version)

Here is the `-p` profile of the Haskell program: http://hpaste.org/49934

The Question: Are there obvious mistakes I am making?

(Note: I am fully aware that brute-forcing it is not the correct solution to
this problem. I am mainly interested in making the Haskell version perform
comparatively to the C# version. Right now it is at least 5x slower so
obviously I am missing something obvious)

(Note 2: It does not seem to be space leaking. The program runs with
constant memory (about 2MB) on my computer)

(Note 3: I am compiling with `ghc -O2 WordNumber.hs)

Chris
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe