Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-26 Thread GHC
#7429: Unexplained performance boost with +RTS -h
--+-
  Reporter:  simonmar |  Owner:  
  Type:  bug  | Status:  closed  
  Priority:  normal   |  Milestone:  7.8.1   
 Component:  Compiler |Version:  7.6.1   
Resolution:  wontfix  |   Keywords:  
Os:  Unknown/Multiple |   Architecture:  Unknown/Multiple
   Failure:  Runtime performance bug  | Difficulty:  Unknown 
  Testcase:   |  Blockedby:  
  Blocking:   |Related:  
--+-

Comment(by tomberek):

 Ok, then I'd like to know what I can do to create the data structure
 strictly. I've got the bang patterns on the CircList,
 {{{
 data CircList s = ConsM {value :: {-# UNPACK #-} !Int
 ,cons  :: {-# UNPACK #-} !(STRef s (CircList s ))
 }
 }}}

 Is there something in the following that should be strictified?
 {{{
 twoItemList :: forall s. ST s (CircList s)
 twoItemList = do
 rec item1 - liftM (ConsM 0) $ newSTRef item2
 item2 - liftM (ConsM 1) $ newSTRef item1
 return item1
 }}}
 or in

 {{{
 -- | Defines a circular list of length n
 nlist :: forall s. Int - ST s (CircList s)
 nlist ns = do
 rec allItems -
 let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns)
 in V.generateM ns next
 return $ (V.!) allItems 0
 where nextCons v n = do
 n' - newSTRef n
 return $ ConsM v n'
 }}}

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:10
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-26 Thread GHC
#7429: Unexplained performance boost with +RTS -h
--+-
  Reporter:  simonmar |  Owner:  
  Type:  bug  | Status:  closed  
  Priority:  normal   |  Milestone:  7.8.1   
 Component:  Compiler |Version:  7.6.1   
Resolution:  wontfix  |   Keywords:  
Os:  Unknown/Multiple |   Architecture:  Unknown/Multiple
   Failure:  Runtime performance bug  | Difficulty:  Unknown 
  Testcase:   |  Blockedby:  
  Blocking:   |Related:  
--+-

Comment(by simonmar):

 The laziness here is in the contents of the `STRef`, where you can't put a
 bang pattern.  Now the problem is that because you're creating a circular
 list, there must be at least ''one'' lazy element in it, because you're
 using laziness to create the circularity.  So you could probably add one
 bang pattern in `twoItemList`, but not two (I haven't tried this).

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:11
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-21 Thread GHC
#7429: Unexplained performance boost with +RTS -h
--+-
  Reporter:  simonmar |  Owner:  
  Type:  bug  | Status:  closed  
  Priority:  normal   |  Milestone:  7.8.1   
 Component:  Compiler |Version:  7.6.1   
Resolution:  wontfix  |   Keywords:  
Os:  Unknown/Multiple |   Architecture:  Unknown/Multiple
   Failure:  Runtime performance bug  | Difficulty:  Unknown 
  Testcase:   |  Blockedby:  
  Blocking:   |Related:  
--+-
Changes (by simonmar):

  * status:  new = closed
  * resolution:  = wontfix


Comment:

 Aha - of course I was forgetting something, the GC will also eliminate the
 indirections, and that seems to explain the difference at least in
 `Main2.hs`.

 In the original version, the data structure with the indirections is
 enough to push it over the size of the L1 cache, so all the misses are
 capacity misses (on my machine anyway).  In `Main2.hs`, the difference is
 all due to having to traverse the extra indirections in the data
 structure.

 Mystery solved - and I don't think there's anything we can do here.  If it
 is important to make this case go fast, you might try to rewrite the
 program so that it creates the data structure strictly, which will
 eliminate the indirections at creation time.

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:9
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-20 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--

Comment(by simonpj):

 Does `-fno-state-hack` make any difference?

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:4
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-20 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--

Comment(by tomberek):

 Replying to [comment:4 simonpj]:
  Does `-fno-state-hack` make any difference?

 No effect.

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:5
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-20 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--

Comment(by simonmar):

 I suspect it's a weird caching effect.  The program creates a small data
 structure and then repeatedly traverses it (if I'm understanding
 correctly). After a GC the program performs much better, which suggests
 that the GC has rearranged the data in memory in a more cache-friendly way
 (not deliberately, I should point out).  Or perhaps the data rearrangement
 has avoided some cache conflicts, but that seems unlikely - L1 caches are
 usually very associative.

 You should be able to get the same effect by adding a performGC after
 allocating the data structure but before traversing it.

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:6
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-20 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--
Changes (by tomberek):

 * cc: tomberek@… (added)


Comment:

 I'm not sure how to insert an IO () into the ST/STO monad, either that
 sounds unsafe or I lift everything to IO? I tried using:

 {{{
 lift $ unsafeIOToST performGC
 }}}

 in Main2.hs

 This fixes the problem when the circular list is of size 2 (using function
 twoItemList).  The problem creeps back in when using threeItemList or when
 the list is produced by nlist. Perhaps my generation of the circular list
 is suspect. Perhaps too lazy? The inner loop in Core is unchanged from
 previous versions.

 Regarding small data structure: This effect is actually more pronounced
 the larger the circular list is.  I've tested this up to n = 2 million and
 it seems that the GC overhead of profiling catches up around 1 million,
 but the non-GC time difference becomes very pronounced.

 So the GC fix works for an explicitly constructed circular list of
 length 2, but nothing else. And still, working in only ST is always slower
 than in a superfluous StateT over ST.

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:7
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-20 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--

Comment(by tomberek):

 Replying to [comment:7 tomberek]:
 correction, the GC fix doesn't work for twoItemList either (it's late, i
 got confused)

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:8
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-19 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--
Changes (by PHO):

 * cc: pho@… (added)


-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:1
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs


Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-19 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--

Comment(by tomberek):

 Replying to [ticket:7429 simonmar]:


 {{{
 {-# LANGUAGE
 RankNTypes
 ,RecursiveDo
 ,TypeFamilies
 ,GeneralizedNewtypeDeriving
   #-}

 import System.TimeIt
 import Control.Monad.State.Strict
 import Control.Monad.ST.Strict
 import Data.STRef

 -- STO Monad is just StateT with an Int stacked on top of a ST monad.
 newtype STO s a = STO { unSTO :: StateT Int (ST s) a }
 deriving (Functor, Monad)

 runSTO :: (forall s. STO s a) - a
 runSTO x = runST (evalStateT (unSTO x) 0)

 data CircList s = ConsM {value :: {-# UNPACK #-} !Int
 ,cons  :: {-# UNPACK #-} !(STRef s (CircList s ))
 }

 -- | Defines a circular list of length 2
 twoItemList :: forall s. ST s (CircList s)
 twoItemList = do
 rec item1 - liftM (ConsM 0) $ newSTRef item2
 item2 - liftM (ConsM 1) $ newSTRef item1
 return item1

 -- | I tie a circular list of size 2 and step through it n times.
 main :: IO ()
 main = do
 let n = 633222111 :: Int
 let takeOneStep = switchSTorSTO . readSTRef . cons

 runLine $ switchSTorSTO twoItemList  = liftM value . iterateM n
 takeOneStep

 --switchSTorSTO = id
 --runLine = timeIt . print . runST
 --    TO SWITCH TO ST: switch between the two
 sets of definitions above and below this line
 switchSTorSTO  = STO . lift
 runLine = timeIt . print . runSTO

 -- I find myself writing this function a lot, did I miss some basic Monad
 helper?
 iterateM :: (Monad m) = Int - (b - m b) - b - m b
 iterateM n f c = go n c where
 go 0 b = return b
 go ns b = f b = go (ns-1)

 }}}

 Here is a simpler version (also attached as Main.hs) that uses a circular
 list of length two. The odd thing is that this runs around 3 seconds when
 I only use the ST monad, or the STO without profiling. But the STO with
 profiling makes it FASTER to 2.6 seconds. The two versions seem to have
 the same GC stats as before. So then why not just use profiling ALL the
 time? I don't know if this difference exists for large 'project sized'
 uses of ST, but it seems a monad transformer on top of it along with
 profiling helps it go faster.

 The STO version produces this core, which looks pretty good to me:

 {{{
 Rec {
 Main.$wa [Occ=LoopBreaker]
   :: forall s_XF2.
  GHC.Prim.Int#
  - Main.CircList s_XF2
  - GHC.Types.Int
  - GHC.Prim.State# s_XF2
  - (# GHC.Prim.State# s_XF2,
(Main.CircList s_XF2, GHC.Types.Int) #)
 [GblId, Arity=4, Caf=NoCafRefs, Str=DmdType ]
 Main.$wa =
   \ (@ s_XF2)
 (ww_s1na :: GHC.Prim.Int#)
 (w_s1nc :: Main.CircList s_XF2)
 (w1_s1nd :: GHC.Types.Int)
 (w2_s1ne :: GHC.Prim.State# s_XF2) -
 case ww_s1na of ds_XFh {
   __DEFAULT -
 case w_s1nc of _ { Main.ConsM rb_dFC rb1_dFD -
 case GHC.Prim.readMutVar#
@ s_XF2 @ (Main.CircList s_XF2) rb1_dFD w2_s1ne
 of _ { (# ipv_aJ9, ipv1_aJa #) -
 Main.$wa @ s_XF2 (GHC.Prim.-# ds_XFh 1) ipv1_aJa w1_s1nd ipv_aJ9
 }
 };
   0 - (# w2_s1ne, (w_s1nc, w1_s1nd) #)
 }
 end Rec }
 }}}

 The effect disappears when you switch to using just ST, but the core looks
 almost identical other than the threading of the original Int in the
 StateT. Here is the core produced for the ST version:

 {{{
 Rec {
 Main.$wpoly_a [Occ=LoopBreaker]
   :: forall s_aDw.
  GHC.Prim.Int#
  - Main.CircList s_aDw
  - GHC.Prim.State# s_aDw
  - (# GHC.Prim.State# s_aDw, Main.CircList s_aDw #)
 [GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
 Main.$wpoly_a =
   \ (@ s_aDw)
 (ww_s1la :: GHC.Prim.Int#)
 (w_s1lc :: Main.CircList s_aDw)
 (w1_s1ld :: GHC.Prim.State# s_aDw) -
 case ww_s1la of ds_XFl {
   __DEFAULT -
 case w_s1lc of _ { Main.ConsM rb_dFI rb1_dFJ -
 case GHC.Prim.readMutVar#
@ s_aDw @ (Main.CircList s_aDw) rb1_dFJ w1_s1ld
 of _ { (# ipv_aIP, ipv1_aIQ #) -
 Main.$wpoly_a @ s_aDw (GHC.Prim.-# ds_XFl 1) ipv1_aIQ ipv_aIP
 }
 };
   0 - (# w1_s1ld, w_s1lc #)
 }
 end Rec }
 }}}

-- 
Ticket URL: 

Re: [GHC] #7429: Unexplained performance boost with +RTS -h

2012-11-19 Thread GHC
#7429: Unexplained performance boost with +RTS -h
-+--
Reporter:  simonmar  |   Owner: 
Type:  bug   |  Status:  new
Priority:  normal|   Milestone:  7.8.1  
   Component:  Compiler  | Version:  7.6.1  
Keywords:|  Os:  Unknown/Multiple   
Architecture:  Unknown/Multiple  | Failure:  Runtime performance bug
  Difficulty:  Unknown   |Testcase: 
   Blockedby:|Blocking: 
 Related:|  
-+--

Comment(by tomberek):

 Some code golf at #haskell produced a new version of iterateM:

 {{{
 iterateM2 :: Monad m = Int - (b - m b) - b - m b
 iterateM2 = (fmap.fmap) (foldr (=) return) replicate
 }}}

 This produces slightly different CORE as follows:
 Rec {
 Main.main6 [Occ=LoopBreaker]
   :: forall s_atH.
  GHC.Prim.Int#
  - Main.CircList s_atH
  - GHC.Types.Int
  - GHC.Prim.State# s_atH
  - (# GHC.Prim.State# s_atH,
(Main.CircList s_atH, GHC.Types.Int) #)
 [GblId, Arity=4, Caf=NoCafRefs, Str=DmdType LSLL]
 Main.main6 =
   \ (@ s_atH)
 (m_a16D :: GHC.Prim.Int#)
 (eta_B3 :: Main.CircList s_atH)
 (eta1_X14 :: GHC.Types.Int)
 (eta2_X15 :: GHC.Prim.State# s_atH) -
 case GHC.Prim.=# m_a16D 1 of _ {
   GHC.Types.False -
 case eta_B3 of _ { Main.ConsM rb_dE3 rb1_dE4 -
 case GHC.Prim.readMutVar#
@ s_atH @ (Main.CircList s_atH) rb1_dE4 eta2_X15
 of _ { (# ipv_aEY, ipv1_aEZ #) -
 Main.main6 @ s_atH (GHC.Prim.-# m_a16D 1) ipv1_aEZ eta1_X14
 ipv_aEY
 }
 };
   GHC.Types.True -
 case eta_B3 of _ { Main.ConsM rb_dE3 rb1_dE4 -
 case GHC.Prim.readMutVar#
@ s_atH @ (Main.CircList s_atH) rb1_dE4 eta2_X15
 of _ { (# ipv_aEY, ipv1_aEZ #) -
 (# ipv_aEY, (ipv1_aEZ, eta1_X14) #)
 }
 }
 }
 end Rec }

 This destroys the effect, ie: it only produces the slow version. The
 iterateM as is, WITH StateT over ST WITH (+RTS -s -h) is the only way I
 have been able to produce the fast version.


 And to clarify the effect:
 1) I compile with -O2 -rtsopts.
 2) I run with +RTS -s
 3) Without recompiling or anything else I run +RTS -s -h

 #3 is faster than #2 or ANYTHING else, including no RTS, no -rtsopts.
 It still works with -O1
 I end up thrashing the system when I compile without an -Ox flag

-- 
Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7429#comment:3
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler

___
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs