Re: Proposal: provide cas and barriers symbols even without -threaded

2013-07-20 Thread Ryan Newton
Edward,

This makes sense to me.  Especially because eliding-synchronization is
already the convention followed in SMP.hs, where, for example,
write_barrier becomes noops if !THREADED_RTS.

All I would need would be linkable symbols for those noops (a la
Inlines.chttps://github.com/ghc/ghc/blob/master/rts/Inlines.c),
not just the #defines that are currently in SMP.h

I think providing these symbols *reliably* would be complementary to
Carter's proposal to handle them better in the LLVM backend.  In fact,
Carter's proposal is more motivation, for me to be using the official
versions in my .cmm ccalls.
   Right now I've literally copy-pasted the relevant code from SMP.h, into
C code called DUP_cas, DUP_write_barrier etc (yuck).  And these
duplicated versions would be missed by the CMM-LLVM conversion Carter has
proposed.

  -Ryan




On Thu, Jul 18, 2013 at 12:11 PM, Edward Z. Yang ezy...@mit.edu wrote:

 I want to note something, which is that if we did link in
 cas/store_load_barrier, then your lockfree queue would always be
 synchronized, even if you didn't compile with -threaded.  Perhaps this
 is not a big deal, but it is generally nice to not pay the cost of
 synchronization when it is unnecessary.  So it would be better if there
 were threaded/nonthreaded variants which you could use instead.  How
 does that sound? (Fortunately, you are not inlining the functions, so
 it's totally possible for this to happen.  We'd have a tougher row
 to hoe if you needed to inline these functions.)

 Edward

 Excerpts from Ryan Newton's message of Thu Jul 18 06:17:44 -0700 2013:
  The atomic-primops library depends on symbols such as
 store_load_barrier
  and cas, which are defined in SMP.h.  Thus the result is that if the
  program is linked WITHOUT -threaded, the user gets a linker error about
  undefined symbols.
 
  The specific place it's used is in the 'foreign C' bits of this .cmm
 code:
 
 
 
 https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c30b98c28c1d04f88781c/AtomicPrimops/cbits/primops.cmm
 
  I'm trying to explore hacks that will enable me to pull in those
 functions
  during compile time, without duplicating a whole bunch of code from the
  RTS.  But it's a fragile business.
 
  It seems to me that some of these routines have general utility.  In
 future
  versions of GHC, could we consider linking in those routines irrespective
  of -threaded?
 
-Ryan

___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs


Re: Proposal: provide cas and barriers symbols even without -threaded

2013-07-20 Thread Ryan Newton
Hi Carter,

Yes, SMP.h is where I've copy pasted the duplicate functionality from
(since I can't presently rely on linking the symbols).

Your proposal for the LLVM backend sounds **great**.  But it also is going
to provide additional constraints for getting atomic-primops right.
   The goal of atomic-primops is to be a stable Haskell-level interface
into the relevant CAS and fetch-and-add stuff.  The reason this is
important is that one has to be very careful to defeat the GHC optimizer in
all the relevant places and make pointer equality a reliable property.  I
would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and
have more native support in future GHC releases, where maybe the foreign
primops would become unecessary.  (They are a pain and have already exposed
one blocking cabal bug, fixed in upcoming 1.17.)

A couple additional suggestions for the proposal in ticket #7883:

   - we should use more unique symbols than cas, especially for this
   rewriting trick.  How about ghc_cas or something?
   - it would be great to get at least fetch-and-add in addition to CAS and
   barriers
   - if we reliably provide this set of special symbols, libraries like
   atomic-primops may use them in the .cmm and benefit from the CMM-LLVM
   substitutions
   - if we include all the primops I need in GHC proper the previous bullet
   will stop applying ;-)

Cheers,
  -Ryan

P.S. Just as a bit of motivation, here are some recent performance numbers.
 We often wonder about how close our pure values in a box approach comes
to efficient lock-free structures.  Well here are some numbers about using
a proper unboxed counter in the Haskell heap, vs using an IORef Int and
atomicModifyIORef':  Up to 100X performance difference on some platforms
for microbenchmarks that hammer a counter:


https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops

And here are the performance and scaling advantages of using ChaseLev
(based on atomic-primops), over a traditional pure-in-a-box structure
(IORef Data.Seq). The following are timings of ChaseLev/traditional
respectively on a 32 core westmere:

fib(42) 1 threads:  21s
fib(42) 2 threads:  10.1s
fib(42) 4 threads:  5.2s (100%prod)
fib(42) 8 threads:  2.7s - 3.2s (100%prod)
fib(42) 16 threads: 1.28s
fib(42) 24 threads: 1.85s
fib(42) 32 threads: 4.8s (high variance)

(hive) fib(42) 1 threads:  41.8s  (95% prod)
(hive) fib(42) 2 threads:  25.2s  (66% prod)
(hive) fib(42) 4 threads:  14.6s  (27% prod, 135GB alloc)
(hive) fib(42) 8 threads:  17.1s  (26% prod)
(hive) fib(42) 16 threads: 16.3s  (13% prod)
(hive) fib(42) 24 threads: 21.2s  (30% prod)
(hive) fib(42) 32 threads: 29.3s  (33% prod)

And that is WITH the inefficiency of doing a ccall on every single atomic
operation.

Notes on parfib performance are here:

https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158







On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald 
carter.schonw...@gmail.com wrote:

 ryan, the relevant machinery on the C side is here, see
 ./includes/stg/SMP.h :
 https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h

 (unless i'm missing something)


 On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald 
 carter.schonw...@gmail.com wrote:

 Ryan,
 if you look at line 270, you'll see the CAS is a C call
 https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270


 What Simon is alluding to is some work I started (but need to finish)
 http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and
 I'll need to sort out doing the same on the native code gen too

 there ARE no write barrier primops, they're baked into the CAS machinery
 in ghc's rts


 On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton rrnew...@gmail.com wrote:

 Yes, I'd absolutely rather not suffer C call overhead for these
 functions (or the CAS functions).  But isn't that how it's done currently
 for the casMutVar# primop?


 https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265

 To avoid the overhead, is it necessary to make each primop in-line
 rather than out-of-line, or just to get rid of the ccall?

 Another reason it would be good to package these with GHC is that I'm
 having trouble building robust libraries of foreign primops that work under
 all ways (e.g. GHCI).  For example, this bug:

 https://github.com/rrnewton/haskell-lockfree-queue/issues/10

 If I write .cmm code that depends on RTS functionality like
 stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode
 (with/without threading, profiling), but I get link errors from GHCI where
 these symbols aren't defined.

 I've got a draft of the relevant primops here:


 

Re: Proposal: provide cas and barriers symbols even without -threaded

2013-07-20 Thread Ryan Newton
Hi Simon,

That sounds like a good solution and I'll attempt a patch.  I think the fix
is only three lines.  That is, replace these three lines with EXTERN_INLINE
C functions:

#define write_barrier()  /* nothing */
#define store_load_barrier() /* nothing */
#define load_load_barrier()  /* nothing */

That would fix the -threaded/unthreaded disparity.  But I still don't see
how to access this stuff properly from foreign-primops in a library such
that GHCI doesn't barf when trying to load the library

  -Ryan
___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs


Re: Proposal: provide cas and barriers symbols even without -threaded

2013-07-20 Thread Carter Schonwald
Ryan, could you explain what you want more precisely? Specifically what you
want in terms of exposed primops using the terminology / vocabulary in
http://llvm.org/docs/LangRef.html#ordering and
http://llvm.org/docs/Atomics.html ?

 I'll first do the work for just the LLVM backend, and Ill likely need
some active guidance / monitoring for the native codegen analogues

(also asked this on ticket for documentation purposes)


On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton rrnew...@gmail.com wrote:

 Hi Carter,

 Yes, SMP.h is where I've copy pasted the duplicate functionality from
 (since I can't presently rely on linking the symbols).

 Your proposal for the LLVM backend sounds **great**.  But it also is
 going to provide additional constraints for getting atomic-primops right.

The goal of atomic-primops is to be a stable Haskell-level interface
 into the relevant CAS and fetch-and-add stuff.  The reason this is
 important is that one has to be very careful to defeat the GHC optimizer in
 all the relevant places and make pointer equality a reliable property.  I
 would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and
 have more native support in future GHC releases, where maybe the foreign
 primops would become unecessary.  (They are a pain and have already exposed
 one blocking cabal bug, fixed in upcoming 1.17.)

 A couple additional suggestions for the proposal in ticket #7883:

- we should use more unique symbols than cas, especially for this
rewriting trick.  How about ghc_cas or something?
- it would be great to get at least fetch-and-add in addition to CAS
and barriers
- if we reliably provide this set of special symbols, libraries like
atomic-primops may use them in the .cmm and benefit from the CMM-LLVM
substitutions
- if we include all the primops I need in GHC proper the previous
bullet will stop applying ;-)

 Cheers,
   -Ryan

 P.S. Just as a bit of motivation, here are some recent performance
 numbers.  We often wonder about how close our pure values in a box
 approach comes to efficient lock-free structures.  Well here are some
 numbers about using a proper unboxed counter in the Haskell heap, vs using
 an IORef Int and atomicModifyIORef':  Up to 100X performance difference
 on some platforms for microbenchmarks that hammer a counter:


 https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops

 And here are the performance and scaling advantages of using ChaseLev
 (based on atomic-primops), over a traditional pure-in-a-box structure
 (IORef Data.Seq). The following are timings of ChaseLev/traditional
 respectively on a 32 core westmere:

 fib(42) 1 threads:  21s
 fib(42) 2 threads:  10.1s
 fib(42) 4 threads:  5.2s (100%prod)
 fib(42) 8 threads:  2.7s - 3.2s (100%prod)
 fib(42) 16 threads: 1.28s
 fib(42) 24 threads: 1.85s
 fib(42) 32 threads: 4.8s (high variance)

 (hive) fib(42) 1 threads:  41.8s  (95% prod)
 (hive) fib(42) 2 threads:  25.2s  (66% prod)
 (hive) fib(42) 4 threads:  14.6s  (27% prod, 135GB alloc)
 (hive) fib(42) 8 threads:  17.1s  (26% prod)
 (hive) fib(42) 16 threads: 16.3s  (13% prod)
 (hive) fib(42) 24 threads: 21.2s  (30% prod)
 (hive) fib(42) 32 threads: 29.3s  (33% prod)

 And that is WITH the inefficiency of doing a ccall on every single
 atomic operation.

 Notes on parfib performance are here:


 https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158







 On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald 
 carter.schonw...@gmail.com wrote:

 ryan, the relevant machinery on the C side is here, see
 ./includes/stg/SMP.h :
 https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h

 (unless i'm missing something)


 On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald 
 carter.schonw...@gmail.com wrote:

 Ryan,
 if you look at line 270, you'll see the CAS is a C call
 https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270


 What Simon is alluding to is some work I started (but need to finish)
 http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and
 I'll need to sort out doing the same on the native code gen too

 there ARE no write barrier primops, they're baked into the CAS machinery
 in ghc's rts


 On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton rrnew...@gmail.com wrote:

 Yes, I'd absolutely rather not suffer C call overhead for these
 functions (or the CAS functions).  But isn't that how it's done currently
 for the casMutVar# primop?


 https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265

 To avoid the overhead, is it necessary to make each primop in-line
 rather than out-of-line, or just to get rid of the ccall?

 Another reason it would be good to package these 

Re: Proposal: provide cas and barriers symbols even without -threaded

2013-07-20 Thread Carter Schonwald
also: HOLY CRAP THATS AWESOME  performance :)

(i'll be wanting to do some cache aware parallel work stealing in the near
future, so this is really really handy for me)


On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton rrnew...@gmail.com wrote:

 Hi Carter,

 Yes, SMP.h is where I've copy pasted the duplicate functionality from
 (since I can't presently rely on linking the symbols).

 Your proposal for the LLVM backend sounds **great**.  But it also is
 going to provide additional constraints for getting atomic-primops right.

The goal of atomic-primops is to be a stable Haskell-level interface
 into the relevant CAS and fetch-and-add stuff.  The reason this is
 important is that one has to be very careful to defeat the GHC optimizer in
 all the relevant places and make pointer equality a reliable property.  I
 would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and
 have more native support in future GHC releases, where maybe the foreign
 primops would become unecessary.  (They are a pain and have already exposed
 one blocking cabal bug, fixed in upcoming 1.17.)

 A couple additional suggestions for the proposal in ticket #7883:

- we should use more unique symbols than cas, especially for this
rewriting trick.  How about ghc_cas or something?
- it would be great to get at least fetch-and-add in addition to CAS
and barriers
- if we reliably provide this set of special symbols, libraries like
atomic-primops may use them in the .cmm and benefit from the CMM-LLVM
substitutions
- if we include all the primops I need in GHC proper the previous
bullet will stop applying ;-)

 Cheers,
   -Ryan

 P.S. Just as a bit of motivation, here are some recent performance
 numbers.  We often wonder about how close our pure values in a box
 approach comes to efficient lock-free structures.  Well here are some
 numbers about using a proper unboxed counter in the Haskell heap, vs using
 an IORef Int and atomicModifyIORef':  Up to 100X performance difference
 on some platforms for microbenchmarks that hammer a counter:


 https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops

 And here are the performance and scaling advantages of using ChaseLev
 (based on atomic-primops), over a traditional pure-in-a-box structure
 (IORef Data.Seq). The following are timings of ChaseLev/traditional
 respectively on a 32 core westmere:

 fib(42) 1 threads:  21s
 fib(42) 2 threads:  10.1s
 fib(42) 4 threads:  5.2s (100%prod)
 fib(42) 8 threads:  2.7s - 3.2s (100%prod)
 fib(42) 16 threads: 1.28s
 fib(42) 24 threads: 1.85s
 fib(42) 32 threads: 4.8s (high variance)

 (hive) fib(42) 1 threads:  41.8s  (95% prod)
 (hive) fib(42) 2 threads:  25.2s  (66% prod)
 (hive) fib(42) 4 threads:  14.6s  (27% prod, 135GB alloc)
 (hive) fib(42) 8 threads:  17.1s  (26% prod)
 (hive) fib(42) 16 threads: 16.3s  (13% prod)
 (hive) fib(42) 24 threads: 21.2s  (30% prod)
 (hive) fib(42) 32 threads: 29.3s  (33% prod)

 And that is WITH the inefficiency of doing a ccall on every single
 atomic operation.

 Notes on parfib performance are here:


 https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158







 On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald 
 carter.schonw...@gmail.com wrote:

 ryan, the relevant machinery on the C side is here, see
 ./includes/stg/SMP.h :
 https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h

 (unless i'm missing something)


 On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald 
 carter.schonw...@gmail.com wrote:

 Ryan,
 if you look at line 270, you'll see the CAS is a C call
 https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270


 What Simon is alluding to is some work I started (but need to finish)
 http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and
 I'll need to sort out doing the same on the native code gen too

 there ARE no write barrier primops, they're baked into the CAS machinery
 in ghc's rts


 On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton rrnew...@gmail.com wrote:

 Yes, I'd absolutely rather not suffer C call overhead for these
 functions (or the CAS functions).  But isn't that how it's done currently
 for the casMutVar# primop?


 https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265

 To avoid the overhead, is it necessary to make each primop in-line
 rather than out-of-line, or just to get rid of the ccall?

 Another reason it would be good to package these with GHC is that I'm
 having trouble building robust libraries of foreign primops that work under
 all ways (e.g. GHCI).  For example, this bug:

 https://github.com/rrnewton/haskell-lockfree-queue/issues/10

 If I write .cmm code that depends on RTS 

Re: Proposal: provide cas and barriers symbols even without -threaded

2013-07-20 Thread Ryan Newton
Sorry, rewrite was too overloaded a term to use here.  I was just
referring to the proposal to substitute the cas funcall with the right
llvm operation.

That is, the approach would pattern match for the CMM code ccall cas or
foreign C cas (I'm afraid I don't know the difference between those)
and replace it with the equivalent LLVM op, right?

I think the assumption there is that the native codegen would still have to
suffer the funcall overhead and use the C versions.  I don't know exactly
what the changes would look like to make barriers/CAS all proper inline
primops, because it would have to reproduce in the code generator all the
platform-specific #ifdef'd C code that is currently in SMP.h.  Which I
guess is doable, but probably only for someone who knows the native GHC
codegen properly...



On Sat, Jul 20, 2013 at 2:30 AM, Carter Schonwald 
carter.schonw...@gmail.com wrote:

 Ryan, could you explain what you want more precisely? Specifically what
 you want in terms of exposed primops using the terminology / vocabulary in
 http://llvm.org/docs/LangRef.html#ordering and
 http://llvm.org/docs/Atomics.html ?

  I'll first do the work for just the LLVM backend, and Ill likely need
 some active guidance / monitoring for the native codegen analogues

 (also asked this on ticket for documentation purposes)


 On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton rrnew...@gmail.com wrote:

 Hi Carter,

 Yes, SMP.h is where I've copy pasted the duplicate functionality from
 (since I can't presently rely on linking the symbols).

 Your proposal for the LLVM backend sounds **great**.  But it also is
 going to provide additional constraints for getting atomic-primops right.

The goal of atomic-primops is to be a stable Haskell-level interface
 into the relevant CAS and fetch-and-add stuff.  The reason this is
 important is that one has to be very careful to defeat the GHC optimizer in
 all the relevant places and make pointer equality a reliable property.  I
 would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and
 have more native support in future GHC releases, where maybe the foreign
 primops would become unecessary.  (They are a pain and have already exposed
 one blocking cabal bug, fixed in upcoming 1.17.)

 A couple additional suggestions for the proposal in ticket #7883:

- we should use more unique symbols than cas, especially for this
rewriting trick.  How about ghc_cas or something?
- it would be great to get at least fetch-and-add in addition to CAS
and barriers
- if we reliably provide this set of special symbols, libraries like
atomic-primops may use them in the .cmm and benefit from the CMM-LLVM
substitutions
- if we include all the primops I need in GHC proper the previous
bullet will stop applying ;-)

 Cheers,
   -Ryan

 P.S. Just as a bit of motivation, here are some recent performance
 numbers.  We often wonder about how close our pure values in a box
 approach comes to efficient lock-free structures.  Well here are some
 numbers about using a proper unboxed counter in the Haskell heap, vs using
 an IORef Int and atomicModifyIORef':  Up to 100X performance difference
 on some platforms for microbenchmarks that hammer a counter:


 https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops

 And here are the performance and scaling advantages of using ChaseLev
 (based on atomic-primops), over a traditional pure-in-a-box structure
 (IORef Data.Seq). The following are timings of ChaseLev/traditional
 respectively on a 32 core westmere:

 fib(42) 1 threads:  21s
 fib(42) 2 threads:  10.1s
 fib(42) 4 threads:  5.2s (100%prod)
 fib(42) 8 threads:  2.7s - 3.2s (100%prod)
 fib(42) 16 threads: 1.28s
 fib(42) 24 threads: 1.85s
 fib(42) 32 threads: 4.8s (high variance)

 (hive) fib(42) 1 threads:  41.8s  (95% prod)
 (hive) fib(42) 2 threads:  25.2s  (66% prod)
 (hive) fib(42) 4 threads:  14.6s  (27% prod, 135GB alloc)
 (hive) fib(42) 8 threads:  17.1s  (26% prod)
 (hive) fib(42) 16 threads: 16.3s  (13% prod)
 (hive) fib(42) 24 threads: 21.2s  (30% prod)
 (hive) fib(42) 32 threads: 29.3s  (33% prod)

 And that is WITH the inefficiency of doing a ccall on every single
 atomic operation.

 Notes on parfib performance are here:


 https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158







 On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald 
 carter.schonw...@gmail.com wrote:

 ryan, the relevant machinery on the C side is here, see
 ./includes/stg/SMP.h :
 https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h

 (unless i'm missing something)


 On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald 
 carter.schonw...@gmail.com wrote:

 Ryan,
 if you look at line 270, you'll see the CAS is a C call