Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-10 Thread Jake McArthur
Here's a transcript from a conversation I had with Conal on IRC.

tl;dr: conal cross-module inlining is only possible because ghc
stashes a definition in a .hi, iuuc.  i'm suggesting that the stashed
definition either (a) never include further inlinings, or (b) be
augmented by such a definition.

Full transcript:

11:23:10 conal jmcarthur: i'm wondering what to do about INLINE
pragmas for vector-space and other libraries.  i love optimizability
and clean/elegant/terse code.  and i don't know how to resolve that
tension.
11:23:21 jmcarthur conal: yeah, me either. it's annoying
11:23:41 jmcarthur conal: a compiler feature to do it more
succinctly would be nice, if we can come up with one
11:23:52 conal jmcarthur: i'm thinking exactly the same
11:24:04 conal jmcarthur: a ghc flag that does what you did manually
11:24:41 jmcarthur conal: yeah, but we still need to do better than
inlining *all* functions. we need to be able to tell it we want it to
inline all functions satisfying some predicate or something
11:25:07 jmcarthur like, there's no point in forcing to inline
functions having absolutely nothing to do with vector, for example
11:25:18 conal jmcarthur: i wonder.  ghc already has some
heuristics.  do we really want anything different/unusual?
11:25:26 jmcarthur then again, combinators that don't inline and get
used in a vector function later might still be annoying
11:26:08 conal jmcarthur: maybe some kind of demand-driven mechanism
11:26:21 jmcarthur conal: that's what i was thinking would be best
11:26:28 conal jmcarthur: ie pull inlining rather than push them.
or some combo.
11:27:21 conal jmcarthur: i don't think this issue is specific to
either vector fusion or to the vector-space package.
11:27:28 jmcarthur conal: actually, this is about rewrite rules more
than inlining
11:27:40 jmcarthur conal: maybe if we focus on the rewrite rules we
can think of something nicer
11:27:46 conal jmcarthur: ah, yeah.
11:28:32 conal jmcarthur: have you talked with they ghc guys about
this issue?  i wonder what practice they'd advise for use with the
current ghc
11:28:54 jmcarthur i have not
11:29:47 conal jmcarthur: how did the inlining/rewriting concern
arise for vector fusion and the vector-space package?
11:30:16 jmcarthur conal: i assume you read the email i linked to?
11:30:27 jmcarthur this one:
http://www.haskell.org/pipermail/haskell-cafe/2010-March/074153.html
11:30:34 conal jmcarthur: yes.  i'l reread now.
11:31:03 jmcarthur conal: in general, you have to add INLINE
pragmas in such cases if you want to be sure your code fuses. A
general-purpose mechanism for handling situations like this
automatically would be great but we haven't found a good one so far.
11:31:14 jmcarthur i think the most relevant line
11:31:49 conal jmcarthur: thx.  and the difficulty (with current
ghc) is specifically cross-module, right?
11:32:00 jmcarthur conal: that is my understanding
11:32:10 jmcarthur but perhaps it is more complex
11:32:49 conal jmcarthur: if so, i wonder if ghc could be fixed to
inline between modules according to the same heuristics as within a
module.
11:34:36 jmcarthur conal: maybe.
11:35:34 conal jmcarthur: part of my discomfort is that i don't know
whether the INLINE directives are more helpful or more harmful under
all uses.  if they were generally helpful, i imagine ghc would do it.
11:56:56 jmcarthur me too
11:57:46 conal jmcarthur: i just found that haskell-cafe thread and
added a reply.
11:58:31 conal jmcarthur: hoping that don, roman, etc will have some
ideas in addressing that discomfort.
12:09:58 jmcarthur conal: apparently the real trick is that GHC will
not inline functions in a function that is annotated INLINE, meaning
that rewrite rules can fire on the outermost rule before firing on
inner ones
12:10:30 jmcarthur conal: i think it would be nice if we could come
up with a way for rewrite rules to affect GHC's inliner
12:10:44 conal jmcarthur: yeah.  maybe INLINE ought to be decomposed
into two sub-meanings.
12:10:45 jmcarthur then it would only happen when necessary
12:11:17 jmcarthur well, the fact that it forces that function to be
inlined is also good though
12:11:33 jmcarthur which is apparently important across module boundaries
12:11:45 conal jmcarthur: maybe ghc could *never* inline functions
into an inline body.  and then do some caching to avoid redundant
work.
12:12:00 jmcarthur perhaps. still leaves the cross-module inlining
issue though
12:12:32 jmcarthur i suspect this is also architectural
12:12:44 jmcarthur ghc doesn't know if it will inline a function
across a module boundary in advance
12:12:52 jmcarthur therefore it goes ahead and inlines into it
12:13:04 jmcarthur *inlines other functions into it
12:13:31 conal jmcarthur: i don't understand how module boundaries
come into play
12:14:40 jmcarthur conal: my suspicion is that because ghc builds
modules separately it can't know whether a function will be inlined in
another module, so if it's not marked INLINE it feels free to 

Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-10 Thread Max Bolingbroke
This is my understanding:

Old story (GHC 6.12.1 (?) and below):
1) Function bodies are only optimised if they are not marked INLINE.
The assumption is that INLINE bodies will be inlined and then
optimised at the point where they are inlined.
2) Unfoldings are exported in .hi files for functions if they are
nonrecursive and small enough or marked INLINE

New story (GHC HEAD):
1) Function bodies are optimised regardless of whether they are marked
INLINE or not.
2) Unfoldings are exported in .hi files for functions if they are
nonrecursive and small enough or marked INLINE, but the unfolding
that is exported is the _unoptimised_ version!
  - It is important to export the unoptimised version because the
optimisation may destroy information we would have relied upon in rule
matching. E.g. the optimised version of a list producer will not use
build, but we need to be able to spot the build in rules to do
foldr/build fusion.

The new story is much more robust - adding INLINE pragmas should be a
much safer thing to do now, because even if the function is not
inlined you will still get an optimised definition for that function.

Other relevant points:
a) There is a new flag, -fexpose-all-unfoldings, in GHC HEAD. This
does what you want - compile vector-space with this flag on and
unfoldings for everything will show up in the .hi file, with no INLINE
annotations necessary. This was intended for supercompilers (such as
the one I'm working on :-) but you can probably use for this as well.
b) GHC already treats RULES differently for the purposes of inlining.
It is very eager to inline into the arguments of a function you have
RULES for, because doing so may let it spot extra chances for the rule
to fire.

Hope that helps,
Max

On 10 March 2010 16:23, Jake McArthur jake.mcart...@gmail.com wrote:
 Here's a transcript from a conversation I had with Conal on IRC.

 tl;dr: conal cross-module inlining is only possible because ghc
 stashes a definition in a .hi, iuuc.  i'm suggesting that the stashed
 definition either (a) never include further inlinings, or (b) be
 augmented by such a definition.

 Full transcript:

 11:23:10 conal jmcarthur: i'm wondering what to do about INLINE
 pragmas for vector-space and other libraries.  i love optimizability
 and clean/elegant/terse code.  and i don't know how to resolve that
 tension.
 11:23:21 jmcarthur conal: yeah, me either. it's annoying
 11:23:41 jmcarthur conal: a compiler feature to do it more
 succinctly would be nice, if we can come up with one
 11:23:52 conal jmcarthur: i'm thinking exactly the same
 11:24:04 conal jmcarthur: a ghc flag that does what you did manually
 11:24:41 jmcarthur conal: yeah, but we still need to do better than
 inlining *all* functions. we need to be able to tell it we want it to
 inline all functions satisfying some predicate or something
 11:25:07 jmcarthur like, there's no point in forcing to inline
 functions having absolutely nothing to do with vector, for example
 11:25:18 conal jmcarthur: i wonder.  ghc already has some
 heuristics.  do we really want anything different/unusual?
 11:25:26 jmcarthur then again, combinators that don't inline and get
 used in a vector function later might still be annoying
 11:26:08 conal jmcarthur: maybe some kind of demand-driven mechanism
 11:26:21 jmcarthur conal: that's what i was thinking would be best
 11:26:28 conal jmcarthur: ie pull inlining rather than push them.
 or some combo.
 11:27:21 conal jmcarthur: i don't think this issue is specific to
 either vector fusion or to the vector-space package.
 11:27:28 jmcarthur conal: actually, this is about rewrite rules more
 than inlining
 11:27:40 jmcarthur conal: maybe if we focus on the rewrite rules we
 can think of something nicer
 11:27:46 conal jmcarthur: ah, yeah.
 11:28:32 conal jmcarthur: have you talked with they ghc guys about
 this issue?  i wonder what practice they'd advise for use with the
 current ghc
 11:28:54 jmcarthur i have not
 11:29:47 conal jmcarthur: how did the inlining/rewriting concern
 arise for vector fusion and the vector-space package?
 11:30:16 jmcarthur conal: i assume you read the email i linked to?
 11:30:27 jmcarthur this one:
 http://www.haskell.org/pipermail/haskell-cafe/2010-March/074153.html
 11:30:34 conal jmcarthur: yes.  i'l reread now.
 11:31:03 jmcarthur conal: in general, you have to add INLINE
 pragmas in such cases if you want to be sure your code fuses. A
 general-purpose mechanism for handling situations like this
 automatically would be great but we haven't found a good one so far.
 11:31:14 jmcarthur i think the most relevant line
 11:31:49 conal jmcarthur: thx.  and the difficulty (with current
 ghc) is specifically cross-module, right?
 11:32:00 jmcarthur conal: that is my understanding
 11:32:10 jmcarthur but perhaps it is more complex
 11:32:49 conal jmcarthur: if so, i wonder if ghc could be fixed to
 inline between modules according to the same heuristics as within a
 module.
 11:34:36 jmcarthur 

Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-10 Thread Conal Elliott
Hi Max.  Thanks much for passing on this info.  Very encouraging news!  -
Conal

On Wed, Mar 10, 2010 at 8:41 AM, Max Bolingbroke batterseapo...@hotmail.com
 wrote:

 This is my understanding:

 Old story (GHC 6.12.1 (?) and below):
 1) Function bodies are only optimised if they are not marked INLINE.
 The assumption is that INLINE bodies will be inlined and then
 optimised at the point where they are inlined.
 2) Unfoldings are exported in .hi files for functions if they are
 nonrecursive and small enough or marked INLINE

 New story (GHC HEAD):
 1) Function bodies are optimised regardless of whether they are marked
 INLINE or not.
 2) Unfoldings are exported in .hi files for functions if they are
 nonrecursive and small enough or marked INLINE, but the unfolding
 that is exported is the _unoptimised_ version!
  - It is important to export the unoptimised version because the
 optimisation may destroy information we would have relied upon in rule
 matching. E.g. the optimised version of a list producer will not use
 build, but we need to be able to spot the build in rules to do
 foldr/build fusion.

 The new story is much more robust - adding INLINE pragmas should be a
 much safer thing to do now, because even if the function is not
 inlined you will still get an optimised definition for that function.

 Other relevant points:
 a) There is a new flag, -fexpose-all-unfoldings, in GHC HEAD. This
 does what you want - compile vector-space with this flag on and
 unfoldings for everything will show up in the .hi file, with no INLINE
 annotations necessary. This was intended for supercompilers (such as
 the one I'm working on :-) but you can probably use for this as well.
 b) GHC already treats RULES differently for the purposes of inlining.
 It is very eager to inline into the arguments of a function you have
 RULES for, because doing so may let it spot extra chances for the rule
 to fire.

 Hope that helps,
 Max

 On 10 March 2010 16:23, Jake McArthur jake.mcart...@gmail.com wrote:
  Here's a transcript from a conversation I had with Conal on IRC.
 
  tl;dr: conal cross-module inlining is only possible because ghc
  stashes a definition in a .hi, iuuc.  i'm suggesting that the stashed
  definition either (a) never include further inlinings, or (b) be
  augmented by such a definition.
 
  Full transcript:
 
  11:23:10 conal jmcarthur: i'm wondering what to do about INLINE
  pragmas for vector-space and other libraries.  i love optimizability
  and clean/elegant/terse code.  and i don't know how to resolve that
  tension.
  11:23:21 jmcarthur conal: yeah, me either. it's annoying
  11:23:41 jmcarthur conal: a compiler feature to do it more
  succinctly would be nice, if we can come up with one
  11:23:52 conal jmcarthur: i'm thinking exactly the same
  11:24:04 conal jmcarthur: a ghc flag that does what you did manually
  11:24:41 jmcarthur conal: yeah, but we still need to do better than
  inlining *all* functions. we need to be able to tell it we want it to
  inline all functions satisfying some predicate or something
  11:25:07 jmcarthur like, there's no point in forcing to inline
  functions having absolutely nothing to do with vector, for example
  11:25:18 conal jmcarthur: i wonder.  ghc already has some
  heuristics.  do we really want anything different/unusual?
  11:25:26 jmcarthur then again, combinators that don't inline and get
  used in a vector function later might still be annoying
  11:26:08 conal jmcarthur: maybe some kind of demand-driven mechanism
  11:26:21 jmcarthur conal: that's what i was thinking would be best
  11:26:28 conal jmcarthur: ie pull inlining rather than push them.
  or some combo.
  11:27:21 conal jmcarthur: i don't think this issue is specific to
  either vector fusion or to the vector-space package.
  11:27:28 jmcarthur conal: actually, this is about rewrite rules more
  than inlining
  11:27:40 jmcarthur conal: maybe if we focus on the rewrite rules we
  can think of something nicer
  11:27:46 conal jmcarthur: ah, yeah.
  11:28:32 conal jmcarthur: have you talked with they ghc guys about
  this issue?  i wonder what practice they'd advise for use with the
  current ghc
  11:28:54 jmcarthur i have not
  11:29:47 conal jmcarthur: how did the inlining/rewriting concern
  arise for vector fusion and the vector-space package?
  11:30:16 jmcarthur conal: i assume you read the email i linked to?
  11:30:27 jmcarthur this one:
  http://www.haskell.org/pipermail/haskell-cafe/2010-March/074153.html
  11:30:34 conal jmcarthur: yes.  i'l reread now.
  11:31:03 jmcarthur conal: in general, you have to add INLINE
  pragmas in such cases if you want to be sure your code fuses. A
  general-purpose mechanism for handling situations like this
  automatically would be great but we haven't found a good one so far.
  11:31:14 jmcarthur i think the most relevant line
  11:31:49 conal jmcarthur: thx.  and the difficulty (with current
  ghc) is specifically cross-module, right?
  11:32:00 

Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-09 Thread Conal Elliott
I think Jake is referring to my vector-space package.  He did the work of
writing 171 INLINE pragmas, covering lots of methods and standalone function
defs.  I'm simultaneously grateful for the effort and repelled by the added
syntactic noise.  Also concerned about the impact of all these directives on
other uses of vector-space.  If all this inlining is a uniform win, I'd
rather ghc did it for me.  If the ghc implementers have reasons not to do so
in general, then I'd expect that some of those reasons apply to vector-space
(and many other libraries) as well.  It's not like I'm applying some kind of
domain-specific understanding that ghc doesn't have access to.

I'm raising my concerns here in the hopes of stimulating some creative
thinking and suggestions about addressing this sort of situation more
generally.

  - Conal

On Sun, Mar 7, 2010 at 9:23 PM, Don Stewart d...@galois.com wrote:

 jake.mcarthur:
  I've run into an issue with inlining that I'm not sure how to work
  around. I am instantiating some pre-existing type classes with
  Vector-based types. There already exist generic functions in modules I
  do not control that use this type class, and they are not tagged with
  the INLINE pragma. I am doubtful, but I figure it is worth at least
  asking: is there some practical workaround for this kind of situation
  that anybody knows about?
 

 I don't know of a way, other than patching the library code.
 If it makes a difference to you, it probably makes a difference to lots
 of people.

 -- Don
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-09 Thread Bryan O'Sullivan
On Tue, Mar 9, 2010 at 11:53 AM, Conal Elliott co...@conal.net wrote:

 I think Jake is referring to my vector-space package.  He did the work of
 writing 171 INLINE pragmas, covering lots of methods and standalone function
 defs.  I'm simultaneously grateful for the effort and repelled by the added
 syntactic noise.  Also concerned about the impact of all these directives on
 other uses of vector-space.  If all this inlining is a uniform win, I'd
 rather ghc did it for me.


Alas, it very much is not easy to predict. The unfortunate thing about
inline directives is that each individual one really can have a substantial,
but not necessarily predictable, effect on the performance of an
application. I have seen large improvements in performance, large drops in
performance, nothing at all, and everything in between, and I have yet to
develop a consistently successful intuition about what will work well, and
when.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-07 Thread Jan-Willem Maessen

On Mar 4, 2010, at 9:05 PM, Roman Leshchinskiy wrote:

 On 05/03/2010, at 04:34, stefan kersten wrote:
 
 i've been hunting down some performance problems in DSP code using vector and
 the single most important transformation seems to be throwing in INLINE 
 pragmas
 for any function that uses vector combinators and is to be called from
 higher-level code. failing to do so seems to prevent vector operations from
 being fused and results in big performance hits (the good news is that the
 optimized code is quite competitive)
 
 the downside after adding the INLINE pragmas is that now some of my modules 
 take
 _really_ long to compile (up to a couple of minutes); any ideas where i can
 start looking to bring the compilation times down again?
 
 Alas, stream fusion (and fusion in general, I guess) requires what I would 
 call whole loop compilation - you need to inline everything into loops. That 
 tends to be slow. I don't know what your code looks like but you could try to 
 control inlining a bit more. For instance, if you have something like this:
 
 foo ... = ... map f xs ...
  where
f x = ...
 

I can confirm that this is a general problem with libraries based on fusion / 
deforestation (having done the independent implementation of fusion in pH back 
in the day).  No INLINE pragma?  No fusion for you!

That said, as Roman points out it'd be nice if when GHC discovers something is 
inlinable, it would inline the original definition (or perhaps the inlined, 
simplified, no-rules-firing version of same).  The problem is that this 
duplicates a lot of the work of the optimizer a lot of the time.

 you could tell GHC not to inline f until fairly late in the game by adding
 
  {-# INLINE [0] f #-}
 
 to the where clause. This helps sometimes.

Hands up if you can remember what things are legal in the braces, and what they 
mean. :-)  I suspect I'm not the only one for whom remembering this stuff is 
fairly hard, in part because it doesn't come up too often.

-Jan-Willem Maessen___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-07 Thread Roman Leshchinskiy
On 06/03/2010, at 03:10, stefan kersten wrote:

 i'm still curious, though, why my three versions of direct convolution perform
 so differently (see attached file). in particular, i somehow expected conv_3 
 to
 be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't 
 had
 a look at the core yet, mainly because i'm lacking the expertise ...

Hmm, one problem is that the current definition of reverse is suboptimal to say 
the least. I'll fix that.

Could you perhaps send me your complete benchmark?

Roman


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-07 Thread Jake McArthur
I've run into an issue with inlining that I'm not sure how to work 
around. I am instantiating some pre-existing type classes with 
Vector-based types. There already exist generic functions in modules I 
do not control that use this type class, and they are not tagged with 
the INLINE pragma. I am doubtful, but I figure it is worth at least 
asking: is there some practical workaround for this kind of situation 
that anybody knows about?


- Jake
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-07 Thread Don Stewart
jake.mcarthur:
 I've run into an issue with inlining that I'm not sure how to work  
 around. I am instantiating some pre-existing type classes with  
 Vector-based types. There already exist generic functions in modules I  
 do not control that use this type class, and they are not tagged with  
 the INLINE pragma. I am doubtful, but I figure it is worth at least  
 asking: is there some practical workaround for this kind of situation  
 that anybody knows about?


I don't know of a way, other than patching the library code.
If it makes a difference to you, it probably makes a difference to lots
of people.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-05 Thread stefan kersten
 This is a general problem when working with RULES-based
 optimisations. Here is an example of what happens: suppose we have
 
 foo :: Vector Int - Vector Int
 foo xs = map (+1) xs
 
 Now, GHC will generate a nice tight loop for this but if in a
 different module, we have something like this:
 
 bar xs = foo (foo xs)
 
 then this won't fuse because (a) foo won't be inlined and (b) even if
 GHC did inline here, it would inline the nice tight loop which can't
 possibly fuse instead of the original map which can. By slapping an
 INLINE pragma on foo, you're telling GHC to (almost) always inline the
 function and to use the original definition for inlining, thus giving
 it a chance to fuse.

thanks for the insight, roman!

 the downside after adding the INLINE pragmas is that now some of my modules 
 take
 _really_ long to compile (up to a couple of minutes); any ideas where i can
 start looking to bring the compilation times down again?
 
 Alas, stream fusion (and fusion in general, I guess) requires what I
 would call whole loop compilation - you need to inline everything into
 loops. That tends to be slow. I don't know what your code looks like
 but you could try to control inlining a bit more. For instance, if you
 have something like this:
 
 foo ... = ... map f xs ...
   where
 f x = ...
 
 you could tell GHC not to inline f until fairly late in the game by adding
 
   {-# INLINE [0] f #-}
 
 to the where clause. This helps sometimes.

thanks, i'll check it out.

 I'm surprised -Odph doesn't produce faster code than -O2. In any
 case, you could try turning these flags on individually (esp.
 -fno-method-sharing and the spec-constr flags) to see how they affect
 performance and compilation times.

in the end it turned out that i had forgotten another INLINE pragma and in my
crude benchmarks -O2 and -Odph give basically the same results, -O2 being a
little faster. i hope i'll have time next week to do proper benchmarks, and i
also want to try ghc HEAD with the llvm patches.

conv_1  conv_2  conv_3
-Odph   1.004   2.715   1.096
-O2 1.000   2.710   1.097

i'm still curious, though, why my three versions of direct convolution perform
so differently (see attached file). in particular, i somehow expected conv_3 to
be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had
a look at the core yet, mainly because i'm lacking the expertise ...

sk
import   Data.Vector.Generic (Vector, (!))
import qualified Data.Vector.Generic as V

conv_1, conv_2, conv_3 :: (Num a, Vector v a) = v a - v a - v a
{-# INLINE conv_1 #-}
conv_1 h x = V.generate (l+m) f
where
m = V.length h - 1
l = V.length x
{-# INLINE f #-}
f n = g 0 n (max 0 (n-l+1)) (min n m)
g y n m k = if m = k
then let y' = y + (h ! m) * (x ! (n-m))
 in y' `seq` g y' n (m+1) k
else y
{-# INLINE conv_2 #-}
conv_2 h x = V.generate (l+m) f
where
l = V.length x
m = V.length h - 1
{-# INLINE f #-}
f n = let j = max 0 (n-l+1)
  k = (min n m) - j + 1
  in V.sum (V.zipWith (*) (V.slice j k h) (V.reverse (V.slice (n - 
j - k + 1) k x)))
{-# INLINE conv_3 #-}
conv_3 h x = V.generate (l+m-1) f
where
m   = V.length h
l   = V.length x
p   = V.replicate (m-1) 0
x'  = p ++ x ++ p
{-# INLINE f #-}
f i = V.sum (V.zipWith (*) (V.reverse h) (V.slice i m x'))
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-04 Thread Don Stewart
sk:
 hi,
 
 two questions in one post:
 
 i've been hunting down some performance problems in DSP code using vector and
 the single most important transformation seems to be throwing in INLINE 
 pragmas
 for any function that uses vector combinators and is to be called from
 higher-level code. failing to do so seems to prevent vector operations from
 being fused and results in big performance hits (the good news is that the
 optimized code is quite competitive). does anybody have some more info about 
 the
 do's and don'ts when programming with vector?

Always inline any combination of things that are expressed in terms of
vector combinators, so that the combination of your code can hope to
fuse as well.

 the downside after adding the INLINE pragmas is that now some of my modules 
 take
 _really_ long to compile (up to a couple of minutes); any ideas where i can
 start looking to bring the compilation times down again?

I'm not sure there's much we can do there.

 i'm compiling with -O2 -funbox-strict-fields instead of -Odph (with ghc 6.10.4
 on OSX 10.4), because it's faster for some of my code, but -O2 vs. -Odph 
 doesn't
 make a noticable difference in compilation time.

-Odph should make it easier for some things to fuse -- and get better
code. But Roman can say more.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

2010-03-04 Thread Roman Leshchinskiy
On 05/03/2010, at 04:34, stefan kersten wrote:

 i've been hunting down some performance problems in DSP code using vector and
 the single most important transformation seems to be throwing in INLINE 
 pragmas
 for any function that uses vector combinators and is to be called from
 higher-level code. failing to do so seems to prevent vector operations from
 being fused and results in big performance hits (the good news is that the
 optimized code is quite competitive). does anybody have some more info about 
 the
 do's and don'ts when programming with vector?

This is a general problem when working with RULES-based optimisations. Here is 
an example of what happens: suppose we have

foo :: Vector Int - Vector Int
foo xs = map (+1) xs

Now, GHC will generate a nice tight loop for this but if in a different module, 
we have something like this:

bar xs = foo (foo xs)

then this won't fuse because (a) foo won't be inlined and (b) even if GHC did 
inline here, it would inline the nice tight loop which can't possibly fuse 
instead of the original map which can. By slapping an INLINE pragma on foo, 
you're telling GHC to (almost) always inline the function and to use the 
original definition for inlining, thus giving it a chance to fuse.

GHC could be a bit cleverer here (perhaps by noticing that the original 
definition is small enough to inline and keeping it) but in general, you have 
to add INLINE pragmas in such cases if you want to be sure your code fuses. A 
general-purpose mechanism for handling situations like this automatically would 
be great but we haven't found a good one so far.

 the downside after adding the INLINE pragmas is that now some of my modules 
 take
 _really_ long to compile (up to a couple of minutes); any ideas where i can
 start looking to bring the compilation times down again?

Alas, stream fusion (and fusion in general, I guess) requires what I would call 
whole loop compilation - you need to inline everything into loops. That tends 
to be slow. I don't know what your code looks like but you could try to control 
inlining a bit more. For instance, if you have something like this:

foo ... = ... map f xs ...
  where
f x = ...

you could tell GHC not to inline f until fairly late in the game by adding

  {-# INLINE [0] f #-}

to the where clause. This helps sometimes.

 i'm compiling with -O2 -funbox-strict-fields instead of -Odph (with ghc 6.10.4
 on OSX 10.4), because it's faster for some of my code, but -O2 vs. -Odph 
 doesn't
 make a noticable difference in compilation time.

If you're *really* interested in performance, I would suggest using GHC head. 
It really is much better for this kind of code (although not necessarily faster 
wrt to compilation times).

This is what -Odph does:

-- -Odph is equivalent to
--
---O2   optimise as much as possible
---fno-method-sharing   sharing specialisation defeats fusion
--  sometimes
---fdicts-cheap always inline dictionaries
---fmax-simplifier-iterations20 this is necessary sometimes
---fsimplifier-phases=3 we use an additional simplifier phase
--  for fusion
---fno-spec-constr-thresholdrun SpecConstr even for big loops
---fno-spec-constr-countSpecConstr as much as possible

I'm surprised -Odph doesn't produce faster code than -O2. In any case, you 
could try turning these flags on individually (esp. -fno-method-sharing and the 
spec-constr flags) to see how they affect performance and compilation times.

Roman


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe