Re: Proposal: provide cas and barriers symbols even without -threaded
Edward, This makes sense to me. Especially because eliding-synchronization is already the convention followed in SMP.hs, where, for example, write_barrier becomes noops if !THREADED_RTS. All I would need would be linkable symbols for those noops (a la Inlines.chttps://github.com/ghc/ghc/blob/master/rts/Inlines.c), not just the #defines that are currently in SMP.h I think providing these symbols *reliably* would be complementary to Carter's proposal to handle them better in the LLVM backend. In fact, Carter's proposal is more motivation, for me to be using the official versions in my .cmm ccalls. Right now I've literally copy-pasted the relevant code from SMP.h, into C code called DUP_cas, DUP_write_barrier etc (yuck). And these duplicated versions would be missed by the CMM-LLVM conversion Carter has proposed. -Ryan On Thu, Jul 18, 2013 at 12:11 PM, Edward Z. Yang ezy...@mit.edu wrote: I want to note something, which is that if we did link in cas/store_load_barrier, then your lockfree queue would always be synchronized, even if you didn't compile with -threaded. Perhaps this is not a big deal, but it is generally nice to not pay the cost of synchronization when it is unnecessary. So it would be better if there were threaded/nonthreaded variants which you could use instead. How does that sound? (Fortunately, you are not inlining the functions, so it's totally possible for this to happen. We'd have a tougher row to hoe if you needed to inline these functions.) Edward Excerpts from Ryan Newton's message of Thu Jul 18 06:17:44 -0700 2013: The atomic-primops library depends on symbols such as store_load_barrier and cas, which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT -threaded, the user gets a linker error about undefined symbols. The specific place it's used is in the 'foreign C' bits of this .cmm code: https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c30b98c28c1d04f88781c/AtomicPrimops/cbits/primops.cmm I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business. It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of -threaded? -Ryan ___ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
Re: Proposal: provide cas and barriers symbols even without -threaded
Hi Carter, Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols). Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting atomic-primops right. The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more native support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.) A couple additional suggestions for the proposal in ticket #7883: - we should use more unique symbols than cas, especially for this rewriting trick. How about ghc_cas or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM-LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-) Cheers, -Ryan P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our pure values in a box approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter: https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere: fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance) (hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod) And that is WITH the inefficiency of doing a ccall on every single atomic operation. Notes on parfib performance are here: https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158 On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald carter.schonw...@gmail.com wrote: ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h (unless i'm missing something) On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald carter.schonw...@gmail.com wrote: Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270 What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton rrnew...@gmail.com wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop? https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265 To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the ccall? Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all ways (e.g. GHCI). For example, this bug: https://github.com/rrnewton/haskell-lockfree-queue/issues/10 If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined. I've got a draft of the relevant primops here:
Re: Proposal: provide cas and barriers symbols even without -threaded
Hi Simon, That sounds like a good solution and I'll attempt a patch. I think the fix is only three lines. That is, replace these three lines with EXTERN_INLINE C functions: #define write_barrier() /* nothing */ #define store_load_barrier() /* nothing */ #define load_load_barrier() /* nothing */ That would fix the -threaded/unthreaded disparity. But I still don't see how to access this stuff properly from foreign-primops in a library such that GHCI doesn't barf when trying to load the library -Ryan ___ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
Re: Proposal: provide cas and barriers symbols even without -threaded
Ryan, could you explain what you want more precisely? Specifically what you want in terms of exposed primops using the terminology / vocabulary in http://llvm.org/docs/LangRef.html#ordering and http://llvm.org/docs/Atomics.html ? I'll first do the work for just the LLVM backend, and Ill likely need some active guidance / monitoring for the native codegen analogues (also asked this on ticket for documentation purposes) On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton rrnew...@gmail.com wrote: Hi Carter, Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols). Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting atomic-primops right. The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more native support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.) A couple additional suggestions for the proposal in ticket #7883: - we should use more unique symbols than cas, especially for this rewriting trick. How about ghc_cas or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM-LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-) Cheers, -Ryan P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our pure values in a box approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter: https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere: fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance) (hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod) And that is WITH the inefficiency of doing a ccall on every single atomic operation. Notes on parfib performance are here: https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158 On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald carter.schonw...@gmail.com wrote: ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h (unless i'm missing something) On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald carter.schonw...@gmail.com wrote: Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270 What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton rrnew...@gmail.com wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop? https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265 To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the ccall? Another reason it would be good to package these
Re: Proposal: provide cas and barriers symbols even without -threaded
also: HOLY CRAP THATS AWESOME performance :) (i'll be wanting to do some cache aware parallel work stealing in the near future, so this is really really handy for me) On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton rrnew...@gmail.com wrote: Hi Carter, Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols). Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting atomic-primops right. The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more native support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.) A couple additional suggestions for the proposal in ticket #7883: - we should use more unique symbols than cas, especially for this rewriting trick. How about ghc_cas or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM-LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-) Cheers, -Ryan P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our pure values in a box approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter: https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere: fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance) (hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod) And that is WITH the inefficiency of doing a ccall on every single atomic operation. Notes on parfib performance are here: https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158 On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald carter.schonw...@gmail.com wrote: ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h (unless i'm missing something) On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald carter.schonw...@gmail.com wrote: Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270 What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton rrnew...@gmail.com wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop? https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265 To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the ccall? Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all ways (e.g. GHCI). For example, this bug: https://github.com/rrnewton/haskell-lockfree-queue/issues/10 If I write .cmm code that depends on RTS
Re: Proposal: provide cas and barriers symbols even without -threaded
Sorry, rewrite was too overloaded a term to use here. I was just referring to the proposal to substitute the cas funcall with the right llvm operation. That is, the approach would pattern match for the CMM code ccall cas or foreign C cas (I'm afraid I don't know the difference between those) and replace it with the equivalent LLVM op, right? I think the assumption there is that the native codegen would still have to suffer the funcall overhead and use the C versions. I don't know exactly what the changes would look like to make barriers/CAS all proper inline primops, because it would have to reproduce in the code generator all the platform-specific #ifdef'd C code that is currently in SMP.h. Which I guess is doable, but probably only for someone who knows the native GHC codegen properly... On Sat, Jul 20, 2013 at 2:30 AM, Carter Schonwald carter.schonw...@gmail.com wrote: Ryan, could you explain what you want more precisely? Specifically what you want in terms of exposed primops using the terminology / vocabulary in http://llvm.org/docs/LangRef.html#ordering and http://llvm.org/docs/Atomics.html ? I'll first do the work for just the LLVM backend, and Ill likely need some active guidance / monitoring for the native codegen analogues (also asked this on ticket for documentation purposes) On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton rrnew...@gmail.com wrote: Hi Carter, Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols). Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting atomic-primops right. The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more native support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.) A couple additional suggestions for the proposal in ticket #7883: - we should use more unique symbols than cas, especially for this rewriting trick. How about ghc_cas or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM-LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-) Cheers, -Ryan P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our pure values in a box approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter: https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere: fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance) (hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod) And that is WITH the inefficiency of doing a ccall on every single atomic operation. Notes on parfib performance are here: https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158 On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald carter.schonw...@gmail.com wrote: ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h (unless i'm missing something) On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald carter.schonw...@gmail.com wrote: Ryan, if you look at line 270, you'll see the CAS is a C call