Re: abysmal multicore performance, especially on AMD processors

2013-11-05 Thread Wm. Josiah Erikson
http://mechanical-sympathy.blogspot.co.uk/ http://lmax-exchange.github.io/disruptor/ *Neale Swinnerton* {t: @sw1nn <https://twitter.com/#!/sw1nn>, w: sw1nn.com } On 27 September 2013 12:29, Wm. Josiah Erikson wrote: > Interesting! If that is true of Java (I don't know Java

Re: abysmal multicore performance, especially on AMD processors

2013-09-27 Thread Wm. Josiah Erikson
Interesting! If that is true of Java (I don't know Java at all), then your argument seems plausible. Cache-to-main-memory writes still take many more CPU cycles (an order of magnitude more, last I knew) than processor-to-cache. I don't think it's so much a bandwidth issue as latency, AFAIK. Thanks

Re: abysmal multicore performance, especially on AMD processors

2013-01-10 Thread Wm. Josiah Erikson
Am I reading this right that this is actually a Java problem, and not clojure-specific? Wouldn't the rest of the Java community have noticed this? Or maybe massive parallelism in this particular way isn't something commonly done with Java in the industry? Thanks for the patches though - it's nice

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
1:00 PM, Lee Spector wrote: > > On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote: > > I think this is a succinct, deterministic benchmark that clearly > demonstrates the problem and also doesn't use conj or reverse. > > Clarification: it's not just a tight

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
Whoops, sorry about the link. It should be able to be found here: http://gibson.hampshire.edu/~josiah/clojush/ On Wed, Dec 19, 2012 at 11:57 AM, Wm. Josiah Erikson wrote: > So here's what we came up with that clearly demonstrates the problem. Lee > provided the code and I tweaked

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly. I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here: clojush_bowling_benchmark.ta

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
ww.azulsystems.com/products/zing/whatisit > > Andy > > On Dec 13, 2012, at 10:41 AM, Wm. Josiah Erikson wrote: > > OK, I did something a little bit different, but I think it proves the same > thing we were shooting for. > > On a 48-way 4 x Opteron 6168 with 32GB of RAM. This

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
Ah. We'll look into running several clojures in one JVM too. Thanks. On Thu, Dec 13, 2012 at 1:41 PM, Wm. Josiah Erikson wrote: > OK, I did something a little bit different, but I think it proves the same > thing we were shooting for. > > On a 48-way 4 x Opteron 6168 with 32G

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's "Bowling" benchmark: 1: multithreaded. Average of 10 runs: 14:00.9 2. singlethreaded. Average of 10 runs: 23:35.3 3. singlethreaded,

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
Hm. Interesting. For the record, the exact code I'm running right now that I'm seeing great parallelism with is this: (defn reverse-recursively [coll] (loop [[r & more :as all] (seq coll) acc '()] (if all (recur more (cons r acc)) acc))) (defn burn ([] (loop [i 0

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
may have just saved us outrageous quantities of time, though Lee isn't convinced that we know exactly what's going on with clojush yet, I don't think. We haven't looked at it yet though I'm sure we will soon enough! On Tue, Dec 11, 2012 at 2:57 PM, Wm. Jos

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7 3770K, when before it was very very much not so. So for some reason, this bug was tickled more dramatically on AMD multicore processors than on Intel ones. On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson wrote: >

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
OK WOW. You hit the nail on the head. It's "reverse" being called in a pmap that does it. When I redefine my own version of reverse (I totally cheated and just stole this) like this: (defn reverse-recursively [coll] (loop [[r & more :as all] (seq coll) acc '()] (if all (recur

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
90528K, 0% used [0x00065035, 0x00065035, 0x000650350200, 0x0007fae0) compacting perm gen total 21248K, used 11049K [0x0007fae0, 0x0007fc2c, 0x0008) the space 21248K, 52% used [0x0007fae0, 0x0007fb8ca638, 0x0007fb8

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
wc -l 158 [josiah@compute-1-17 benchmark]$ On Mon, Dec 10, 2012 at 12:55 PM, Wm. Josiah Erikson wrote: > Aha. Not only do I get a lot of "made not entrant", I get a lot of "made > zombie". However, I get this for both runs with map and with pmap (and with > p

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
Aha. Not only do I get a lot of "made not entrant", I get a lot of "made zombie". However, I get this for both runs with map and with pmap (and with pmapall as well) For instance, from a pmapall run: 33752 159 clojure.lang.Cons::next (10 bytes) made zombie 33752 164

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
> when allocating memory? Should JVM memory allocations be completely > parallel with no synchronization when running multiple threads, or do > memory allocations sometimes lock a shared data structure? > > Andy > > > On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Andy: The short answer is yes, and we saw huge speedups. My latest post, as well as Lee's, has details. On Friday, December 7, 2012 9:42:03 PM UTC-5, Andy Fingerhut wrote: > > > On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: > > > > > Another strange observation is that we can run multiple inst

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved r