Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Dave Tenny
As a person who has recently been dabbling with clojure for evaluation purposes I wondered if anybody wanted to post some links about parallel clojure apps that have been clear and easy parallelism wins for the types of applications that clojure was designed for. (To contrast the lengthy

Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread László Török
Hi, I believe Clojure's original mission has been giving you tools for handling concurrency[1] in your programs in a sane way. However, with the advent of Reducers[2], the landscape is changing quite a bit. If you're interested in the concurrency vs. parallelism terminology and what language

Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Michael Klishin
2013/11/6 Dave Tenny dave.te...@gmail.com (To contrast the lengthy discussion and analysis of this topic that is *hopefully* the exception and not the rule) Some of the comments reveal that part of the problem is in part with JVM memory allocator which has its throughput limits. There are

Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Timothy Baldridge
You should also specify how many cores you plan on devoting to your application. Notice that most of this discussion has been about JVM apps running on machines with 32 cores. Systems like this aren't exactly common in my line of work (where we tend to run greater numbers of smaller servers using

Re: abysmal multicore performance, especially on AMD processors

2013-09-27 Thread Wm. Josiah Erikson
Interesting! If that is true of Java (I don't know Java at all), then your argument seems plausible. Cache-to-main-memory writes still take many more CPU cycles (an order of magnitude more, last I knew) than processor-to-cache. I don't think it's so much a bandwidth issue as latency, AFAIK. Thanks

Re: abysmal multicore performance, especially on AMD processors

2013-09-27 Thread Neale Swinnerton
The disruptor project from LMAX has wrestled with these sort of issues at length and achieved astounding levels of performance on the JVM Martin Thompson, the original author of the disruptor, is a leading light in the JVM performance space, his mechanical sympathy blog is a goldmine of

Re: abysmal multicore performance, especially on AMD processors

2013-09-26 Thread Andy Fingerhut
Adding to this thread from almost a year ago. I don't have conclusive proof with experiments to show right now, but I do have some experiments that have led me to what I think is a plausible cause of not just Clojure programs running more slowly when multi-threaded than when single-threaded, but

multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
Keeping the discussion here would make sense, esp. in light of meetup.com's horrible discussion board. I don't have a lot to offer on the JVM/Clojure-specific problem beyond what I wrote in that meetup thread, but Lee's challenge(s) were too hard to resist: Would your conclusion be something

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Marshall Bockrath-Vandegrift
Chas Emerick c...@cemerick.com writes: Keeping the discussion here would make sense, esp. in light of meetup.com's horrible discussion board. Excellent. Solves the problem of deciding the etiquette of jumping on the meetup board for a meetup one has never been involved in. :-) The nature

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
On Jan 31, 2013, at 9:23 AM, Marshall Bockrath-Vandegrift wrote: Chas Emerick c...@cemerick.com writes: The nature of the `burn` program is such that I'm skeptical of the ability of any garbage-collected runtime (lispy or not) to scale its operation across multiple threads. Bringing you

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Lee Spector
On Jan 31, 2013, at 10:15 AM, Chas Emerick wrote: Then Wm. Josiah posted a full-application benchmark, which appears to have entirely different performance problems from the synthetic `burn` benchmark. I’d rejected GC as the cause for the slowdown there too, but ATM can’t recall why or

Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Marshall Bockrath-Vandegrift
Wm. Josiah Erikson wmjos...@gmail.com writes: Am I reading this right that this is actually a Java problem, and not clojure-specific? Wouldn't the rest of the Java community have noticed this? Or maybe massive parallelism in this particular way isn't something commonly done with Java in the

Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Lee Spector
FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different

Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Andy Fingerhut
Josiah mentioned requesting a free trial of the ZIng JVM. Did you ever get access to that, and were able to try your code running on that? Again, I have no direct experience with their product to guarantee you better results -- just that I've heard good things about their ability to handle

Re: abysmal multicore performance, especially on AMD processors

2013-01-10 Thread Wm. Josiah Erikson
Am I reading this right that this is actually a Java problem, and not clojure-specific? Wouldn't the rest of the Java community have noticed this? Or maybe massive parallelism in this particular way isn't something commonly done with Java in the industry? Thanks for the patches though - it's nice

Re: abysmal multicore performance, especially on AMD processors

2012-12-30 Thread cameron
I've posted a patch with some changes here (https://gist.github.com/4416803), it includes the record change here and a small change to interpret-instruction, the benchmark runs 2x the default as it did for Marshall. The patch also modifies the main loop to use a thread pool instead of agents

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread cameron
Hi Lee, I've done some more digging and seem to have found the root of the problem, it seems that java native methods are much slower when called in parallel. The following code illustrates the problem: (letfn [(time-native [f] (let [c (class [])] (time (dorun

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread Leonardo Borges
In that case isn't context switching dominating your test? .isArray isn't expensive enough to warrant the use of pmap Leonardo Borges www.leonardoborges.com On Dec 29, 2012 10:29 AM, cameron cdor...@gmail.com wrote: Hi Lee, I've done some more digging and seem to have found the root of the

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread cameron
No, it's not the context switching, changing isArray (a native method) to getAnnotations (a normal jvm method) gives the same time for both the parallel and serial version. Cameron. On Saturday, December 29, 2012 10:34:42 AM UTC+11, Leonardo Borges wrote: In that case isn't context switching

Re: abysmal multicore performance, especially on AMD processors

2012-12-24 Thread cameron
I've been moving house for the last week or so but I'll also give the benchmark another look. My initial profiling seemed to show that the parallel version was spending a significant amount of time in java.lang.isArray, clojush.pushstate/stack-ref is calling nth on the result of cons, since it

Re: abysmal multicore performance, especially on AMD processors

2012-12-22 Thread Lee Spector
On Dec 21, 2012, at 6:59 PM, Meikel Brandmeyer wrote: Is there a much simpler way that I overlooked? I'm not sure it's simpler, but it's more straight-forward, I'd say. Thanks Marshall and Mikel on the struct-record conversion code. I'll definitely make a change along those lines.

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Marshall Bockrath-Vandegrift
Wm. Josiah Erikson wmjos...@gmail.com writes: I hope this helps people get to the bottom of things. Not to the bottom of things yet, but found some low-hanging fruit – switching the `push-state` from a struct-map to a record gives a flat ~2x speedup in all configurations I tested. So, that’s

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Lee Spector
On Dec 21, 2012, at 5:22 PM, Marshall Bockrath-Vandegrift wrote: Not to the bottom of things yet, but found some low-hanging fruit – switching the `push-state` from a struct-map to a record gives a flat ~2x speedup in all configurations I tested. So, that’s good? I really appreciate your

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes: FWIW I used records for push-states at one point but did not observe a speedup and it required much messier code, so I reverted to struct-maps. But maybe I wasn't doing the right timings. I'm curious about how you changed to records without the

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Meikel Brandmeyer
Hi, Am 22.12.12 00:37, schrieb Lee Spector: ;; this is defined elsewhere, and I want push-states to have fields for each push-type that's defined here (def push-types '(:exec :integer :float :code :boolean :string :zip :tag :auxiliary :return :environment) (defn

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly. I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here:

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
Whoops, sorry about the link. It should be able to be found here: http://gibson.hampshire.edu/~josiah/clojush/ On Wed, Dec 19, 2012 at 11:57 AM, Wm. Josiah Erikson wmjos...@gmail.comwrote: So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Lee Spector
On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote: I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. Clarification: it's not just a tight loop involving reverse/conj, as our previous benchmark was. It's our

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
I tried redefining the few places in the code (string_reverse, I think) that used reverse to use the same version of reverse that I got such great speedups with in your code, and it made no difference. There are not any explicit calls to conj in the code that I could find. On Wed, Dec 19, 2012 at

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Tassilo Horn
Wm. Josiah Erikson wmjos...@gmail.com writes: Then run, for instance: /usr/bin/time -f %E lein run clojush.examples.benchmark-bowling and then, when that has finished, edit src/clojush/examples/benchmark_bowling.clj and uncomment :use-single-thread true and run it again. I think this is a

Re: abysmal multicore performance, especially on AMD processors

2012-12-16 Thread Lee Spector
On Dec 15, 2012, at 1:14 AM, cameron wrote: Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java for my GP work but for the last few years it's been GEVA with a clojure wrapper I wrote (https://github.com/cdorrat/geva-clj). Ah yes -- I've actually downloaded and

Re: abysmal multicore performance, especially on AMD processors

2012-12-16 Thread Lee Spector
On Dec 14, 2012, at 10:41 PM, cameron wrote: Until Lee has a representative benchmark for his application it's difficult to tell if he's experiencing the same problem but there would seem to be a case for changing the PersistentList implementation in clojure.lang. We put together a

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread Herwig Hochleitner
I've created a test harness for this as a leiningen plugin: https://github.com/bendlas/lein-partest You can just put :plugins [[net.bendlas/lein-partest 0.1.0]] into your project and run lein partest your.ns/testfn 6 to run 6 threads/processes in parallel The plugin then runs the

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread cameron
Thanks Herwig, I used your plugin with the following 2 burn variants: (defn burn-slow [ _] (count (last (take 1000 (iterate #(reduce conj '() %) (range 1)) (defn burn-fast [ _] (count (last (take 1000 (iterate #(reduce conj* (list nil) %) (range 1)) Where conj* is just a

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread cameron
I'd be interested in seeing your GP system. The one we're using evolves Push programs and I suspect that whatever's triggering this problem with multicore utilization is stemming from something in the inner loop of my Push interpreter (https://github.com/lspector/Clojush)... but I don't

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling benchmark: 1: multithreaded. Average of 10 runs: 14:00.9 2. singlethreaded. Average of 10 runs: 23:35.3 3. singlethreaded, 8

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
Ah. We'll look into running several clojures in one JVM too. Thanks. On Thu, Dec 13, 2012 at 1:41 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote: OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Andy Fingerhut
I'm not saying that I know this will help, but if you are open to trying a different JVM that has had a lot of work done on it to optimize it for high concurrency, Azul's Zing JVM may be worth a try, to see if it increases parallelism for a single Clojure instance in a single JVM, with lots of

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
Cool. I've requested a free trial. On Thu, Dec 13, 2012 at 1:53 PM, Andy Fingerhut andy.finger...@gmail.comwrote: I'm not saying that I know this will help, but if you are open to trying a different JVM that has had a lot of work done on it to optimize it for high concurrency, Azul's Zing JVM

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread cameron
On Friday, December 14, 2012 5:41:59 AM UTC+11, Wm. Josiah Erikson wrote: Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node Have you made any progress on a small deterministic benchmark that reflects your

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Lee Spector
On Dec 13, 2012, at 4:21 PM, cameron wrote: Have you made any progress on a small deterministic benchmark that reflects your applications behaviour (ie. the RNG seed work you were discussing)? I'm keen to help, but I don't have time to look at benchmarks that take hours to run. I've

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread cameron
Hi Marshall, the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in. If I understand correctly we believe that it's the fact that the base case (an PersistentList$EmptyList instance) and the normal case (an PersistsentList instance) have

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Marshall Bockrath-Vandegrift
Andy Fingerhut andy.finger...@gmail.com writes: I'm not practiced in recognizing megamorphic call sites, so I could be missing some in the example code below, modified from Lee's original code. It doesn't use reverse or conj, and as far as I can tell doesn't use PersistentList, either, only

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Marshall Bockrath-Vandegrift
cameron cdor...@gmail.com writes:   the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in. ... I was toying with the idea of replacing the EmptyList class with a PersistsentList instance to mitigate the problem in at least one common

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Andy Fingerhut
Lee: I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well. Have you tried

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Lee Spector
On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote: Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Christophe Grand
Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently. Christophe On Wed, Dec 12, 2012 at 4:11 PM, Lee Spector lspec...@hampshire.edu wrote: On Dec 12, 2012, at 10:03 AM,

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Lee Spector
On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote: Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently. I'm not actually sure how to do that. We're starting runs with

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread cameron
On Thursday, December 13, 2012 12:51:57 AM UTC+11, Marshall Bockrath-Vandegrift wrote: cameron cdo...@gmail.com javascript: writes: the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in. ... I was toying with the idea of

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Christophe Grand
See https://github.com/flatland/classlojure for a, nearly, ready-made solution to running several Clojures in one JVM. On Wed, Dec 12, 2012 at 5:20 PM, Lee Spector lspec...@hampshire.edu wrote: On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote: Lee, while you are at benchmarking, would

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
nicolas.o...@gmail.com nicolas.o...@gmail.com writes: What happens if your run it a third time at the end?  (The question is related to the fact that there appears to be transition states between monomorphic and megamorphic call sites,  which might lead to an explanation.) Same results, but

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector
On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote: I’m not sure what the next steps are. Open a bug on the JVM? This is something one can attempt to circumvent on a case-by-case basis, but IHMO has significant negative implications for Clojure’s concurrency story. I've gotten

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Gary Johnson
Lee, My reading of this thread is not quite as pessimistic as yours. Here is my synthesis for the practical application developer in Clojure from reading and re-reading all of the posts above. Marshall and Cameron, please feel free to correct me if I screw anything up here royally. ;-) When

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes: Is the following a fair characterization pending further developments? If you have a cons-intensive task then even if it can be divided into completely independent, long-running subtasks, there is currently no known way to get significant speedups

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector
On Dec 11, 2012, at 11:40 AM, Marshall Bockrath-Vandegrift wrote: Or have I missed a currently-available work-around among the many suggestions? You can specialize your application to avoid megamodal call sites in tight loops. If you are working with `Cons`-order sequences, just use

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes: If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
OK WOW. You hit the nail on the head. It's reverse being called in a pmap that does it. When I redefine my own version of reverse (I totally cheated and just stole this) like this: (defn reverse-recursively [coll] (loop [[r more :as all] (seq coll) acc '()] (if all (recur

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7 3770K, when before it was very very much not so. So for some reason, this bug was tickled more dramatically on AMD multicore processors than on Intel ones. On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Andy Fingerhut
Marshall: I'm not practiced in recognizing megamorphic call sites, so I could be missing some in the example code below, modified from Lee's original code. It doesn't use reverse or conj, and as far as I can tell doesn't use PersistentList, either, only Cons. (defn burn-cons [size] (let

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
...and, suddenly, the high-core-count Opterons show us what we wanted and hoped for. If I increase that range statement to 100 and run it on the 48-core node, it takes 50 seconds (before it took 50 minutes), while the FX-8350 takes 3:31.89 and the 3770K takes 3:48.95. Thanks Marshall! I think you

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
Hm. Interesting. For the record, the exact code I'm running right now that I'm seeing great parallelism with is this: (defn reverse-recursively [coll] (loop [[r more :as all] (seq coll) acc '()] (if all (recur more (cons r acc)) acc))) (defn burn ([] (loop [i 0

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector
On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. Tantalizing, but on investigation I see that our real application actually does

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread cameron
Hi Marshall, I think we're definitely on the right track. If I replace the reverse call with the following function I get a parallel speedup of ~7.3 on an 8 core machine. (defn copy-to-java-list [coll] (let [lst (java.util.LinkedList.)] (doseq [x coll] (.addFirst lst x)) lst))

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marko Topolnik
The main GC feature here are the Thread-Local Allocation Buffers. They are on by default and are automatically sized according to allocation patterns. The size can also be fine-tuned with the -XX:TLABSize=nconfiguration option. You may consider tweaking this setting to optimize runtime.

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marshall Bockrath-Vandegrift
cameron cdor...@gmail.com writes: There does seem to be something unusual about conj and clojure.lang.PersistentList in this parallel test case and I don't think it's related to the JVMs memory allocation. I’ve got a few more data-points, but still no handle on what exactly is going on. My

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread meteorfox
- Parallel allocation of `Cons` and `PersistentList` instances through a Clojure `conj` function remains fast as long as the function only ever returns objects of a single concrete type A possible explanation for this could be JIT Deoptimization. Deoptimization happens when the

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
Aha. Not only do I get a lot of made not entrant, I get a lot of made zombie. However, I get this for both runs with map and with pmap (and with pmapall as well) For instance, from a pmapall run: 33752 159 clojure.lang.Cons::next (10 bytes) made zombie 33752 164

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
I tried some more performance tuning options in Java, just for kicks, and didn't get any advantages from them: -server -XX:+TieredCompilation -XX:ReservedCodeCacheSize=256m Also, in case it's informative: [josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-1.txt | wc -l 173

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marshall Bockrath-Vandegrift
Wm. Josiah Erikson wmjos...@gmail.com writes: Aha. Not only do I get a lot of made not entrant, I get a lot of made zombie. However, I get this for both runs with map and with pmap (and with pmapall as well) I’m not sure this is all that enlightening. From what I can gather, “made not

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
Interesting. I tried the following: :jvm-opts [-Xmx10g -Xms10g -XX:+AggressiveOpts -server -XX:+TieredCompilation -XX:ReservedCodeCacheSize=256m -XX:TLABSize=1G -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseParNewGC -XX:+ResizeTLAB -XX:+UseTLAB] I got a slight slowdown, and the GC details

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Jim - FooBar();
Hi Lee, Would it be difficult to try the following version of 'pmap'? It doesn't use futures but executors instead so at least this could help narrow the problem down... If the problem is due to the high number of futures spawned by pmap then this should fix it... (defn- with-thread-pool*

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Marshall Bockrath-Vandegrift
cameron cdor...@gmail.com writes: Interesting problem, the slowdown seems to being caused by the reverse call (actually the calls to conj with a list argument). Excellent analysis, sir! I think this points things in the right direction. fast-reverse    : map-ms: 3.3, pmap-ms 0.7, speedup

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Softaddicts
If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck. There has to be some locking done somewhere otherwise the heap would corrupt :) The other bottleneck can come from garbage collection which has to freeze object allocation

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut
On Dec 8, 2012, at 9:37 PM, Lee Spector wrote: On Dec 8, 2012, at 10:19 PM, meteorfox wrote: Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 processes are waiting for CPU, and this is due to memory

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut
On Dec 9, 2012, at 4:48 AM, Marshall Bockrath-Vandegrift wrote: It’s like there’s a lock of some sort sneaking in on the `conj` path. Any thoughts on what that could be? My current best guess is the JVM's memory allocator, not Clojure code. Andy -- You received this message because you

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Marshall Bockrath-Vandegrift
Andy Fingerhut andy.finger...@gmail.com writes: My current best guess is the JVM's memory allocator, not Clojure code. I didn’t mean to imply the problem was in Clojure itself, but I don’t believe the issue is in the memory allocator either. I now believe the problem is in a class of JIT

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut
On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck. There has to be some locking done somewhere otherwise the heap would corrupt :) The other bottleneck can come from

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Softaddicts
There's no magic here, everyone tuning their app hit this wall eventually, tweaking the JVM memory options :) Luc On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck.

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Jim - FooBar();
Even though this is very surprising (and sad) to hear, I'm afraid I've got different experiences... My reducer-based parallel minimax is about 3x faster than the serial one, on my 4-core AMD phenom II and a tiny bit faster on my girlfriend's intel i5 (2 physical cores + 2 virtual). I'm

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes: I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads.

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 7, 2012, at 9:42 PM, Andy Fingerhut wrote: When you say we can run multiple instances of the test on the same machine, do you mean that, for example, on an 8 core machine you run 8 different JVMs in parallel, each doing a single-threaded 'map' in your Clojure code and not a

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 9:36 AM, Marshall Bockrath-Vandegrift wrote: Although it doesn’t impact your benchmark, `pmap` may be further adversely affecting the performance of your actual program. There’s a open bug regarding `pmap` and chunked seqs: http://dev.clojure.org/jira/browse/CLJ-862

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Paul deGrandis
My experiences in the past are similar to the numbers that Jim is reporting. I have recently been centering most of my crunching code around reducers. Is it possible for you to cook up a small representative test using reducers+fork/join (and potentially primitives in the intermediate steps)?

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 1:28 PM, Paul deGrandis wrote: My experiences in the past are similar to the numbers that Jim is reporting. I have recently been centering most of my crunching code around reducers. Is it possible for you to cook up a small representative test using reducers+fork/join

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Andy: The short answer is yes, and we saw huge speedups. My latest post, as well as Lee's, has details. On Friday, December 7, 2012 9:42:03 PM UTC-5, Andy Fingerhut wrote: On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: Another strange observation is that we can run multiple instances

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Andy Fingerhut
I haven't analyzed your results in detail, but here are some results I had on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011. When running multiple threads within a single JVM invocation, I never got a speedup of even 2. The highest speedup I measured was 1.82 speedup when I ran 8

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Andy Fingerhut
On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1)

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 3:42 PM, Andy Fingerhut wrote: I'm hoping you realize that (take 1 (iterate reverse value)) is reversing a linked list 1 times, each time allocating 1 cons cells (or Clojure's equivalent of a cons cell)? For a total of around 100,000,000 memory allocations

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
I'm glad somebody else can duplicate our findings! I get results similar to this on Intel hardware. On AMD hardware, the disparity is bigger, and multiple threads of a single JVM invocation on AMD hardware consistently gives me slowdowns as compared to a single thread. Also, your results are

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Marek Šrank
Just tried, my first foray into reducers, but I must not be understanding something correctly: (time (r/map burn (doall (range 4 returns in less than a second on my macbook pro, whereas (time (doall (map burn (range 4 takes nearly a minute. This feels like unforced

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Lee: So I ran On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in top and the like that I was

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Lee: I ran Linux perf and also watched the run queue (with vmstat) and your bottleneck is basically memory access. The CPUs are idle 80% of the time by stalled cycles. Here's what I got on my machine. Intel Core i7 4 cores with Hyper thread (8 virtual processors) 16 GiB of Memory Oracle JVM

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Correction regarding the run-queue, this is not completely correct, :S . But the stalled cycles and memory accesses still holds. Sorry for the misinformation. On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: I've been running compute intensive (multi-day), highly parallelizable

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote: Yep, reducers, don't use lazy seqs. But they return just sth. like transformed functions, that will be applied when building the collection. So you can use them like this: (into [] (r/map burn (doall (range 4) See

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 10:19 PM, meteorfox wrote: Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 processes are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread cameron
Interesting problem, the slowdown seems to being caused by the reverse call (actually the calls to conj with a list argument). Calling conj in a multi-threaded environment seems to have a significant performance impact when using lists I created some alternate reverse implementations (the

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread cameron
I forgot to mention, I cut the number of reverse iterations down to 1000 (not 1) so I wouldn't have to wait too long for criterium, the speedup numbers are representative of the full test though. Cameron. On Sunday, December 9, 2012 6:26:16 PM UTC+11, cameron wrote: Interesting

Re: abysmal multicore performance, especially on AMD processors

2012-12-07 Thread Andy Fingerhut
Lee: I'll just give a brief description right now, but one thing I've found in the past on a 2-core machine that was achieving much less than 2x speedup was memory bandwidth being the limiting factor. Not all Clojure code allocates memory, but a lot does. If the hardware in a system can

Re: abysmal multicore performance, especially on AMD processors

2012-12-07 Thread Lee Spector
Thanks Andy. My applications definitely allocate a lot of memory, which is reflected in all of that consing in the test I was using. It'd be hard to do what we do in any other way. I can see how a test using a Java mutable array would help to diagnose the problem, but if that IS the problem

  1   2   >