As a person who has recently been dabbling with clojure for evaluation
purposes I wondered if anybody wanted to post some links about parallel
clojure apps that have been clear and easy parallelism wins for the types
of applications that clojure was designed for. (To contrast the lengthy
Hi,
I believe Clojure's original mission has been giving you tools for handling
concurrency[1] in your programs in a sane way.
However, with the advent of Reducers[2], the landscape is changing quite a
bit.
If you're interested in the concurrency vs. parallelism terminology and
what language
2013/11/6 Dave Tenny dave.te...@gmail.com
(To contrast the lengthy discussion and analysis of this topic that is
*hopefully* the exception and not the rule)
Some of the comments reveal that part of the problem is in part with JVM
memory allocator
which has its throughput limits.
There are
You should also specify how many cores you plan on devoting to your
application. Notice that most of this discussion has been about JVM apps
running on machines with 32 cores. Systems like this aren't exactly common
in my line of work (where we tend to run greater numbers of smaller servers
using
Interesting! If that is true of Java (I don't know Java at all), then your
argument seems plausible. Cache-to-main-memory writes still take many more
CPU cycles (an order of magnitude more, last I knew) than
processor-to-cache. I don't think it's so much a bandwidth issue as
latency, AFAIK. Thanks
The disruptor project from LMAX has wrestled with these sort of issues at
length and achieved astounding levels of performance on the JVM
Martin Thompson, the original author of the disruptor, is a leading light
in the JVM performance space, his mechanical sympathy blog is a goldmine of
Adding to this thread from almost a year ago. I don't have conclusive
proof with experiments to show right now, but I do have some experiments
that have led me to what I think is a plausible cause of not just Clojure
programs running more slowly when multi-threaded than when single-threaded,
but
Keeping the discussion here would make sense, esp. in light of meetup.com's
horrible discussion board.
I don't have a lot to offer on the JVM/Clojure-specific problem beyond what I
wrote in that meetup thread, but Lee's challenge(s) were too hard to resist:
Would your conclusion be something
Chas Emerick c...@cemerick.com writes:
Keeping the discussion here would make sense, esp. in light of
meetup.com's horrible discussion board.
Excellent. Solves the problem of deciding the etiquette of jumping on
the meetup board for a meetup one has never been involved in. :-)
The nature
On Jan 31, 2013, at 9:23 AM, Marshall Bockrath-Vandegrift wrote:
Chas Emerick c...@cemerick.com writes:
The nature of the `burn` program is such that I'm skeptical of the
ability of any garbage-collected runtime (lispy or not) to scale its
operation across multiple threads.
Bringing you
On Jan 31, 2013, at 10:15 AM, Chas Emerick wrote:
Then Wm. Josiah posted a full-application benchmark, which appears to
have entirely different performance problems from the synthetic `burn`
benchmark. I’d rejected GC as the cause for the slowdown there too, but
ATM can’t recall why or
Wm. Josiah Erikson wmjos...@gmail.com writes:
Am I reading this right that this is actually a Java problem, and not
clojure-specific? Wouldn't the rest of the Java community have noticed
this? Or maybe massive parallelism in this particular way isn't
something commonly done with Java in the
FYI we had a bit of a discussion about this at a meetup in Amherst MA
yesterday, and while I'm not sufficiently on top of the JVM or system issues to
have briefed everyone on all of the details there has been a little of followup
since the discussion, including results of some different
Josiah mentioned requesting a free trial of the ZIng JVM. Did you ever get
access to that, and were able to try your code running on that?
Again, I have no direct experience with their product to guarantee you better
results -- just that I've heard good things about their ability to handle
Am I reading this right that this is actually a Java problem, and not
clojure-specific? Wouldn't the rest of the Java community have noticed
this? Or maybe massive parallelism in this particular way isn't something
commonly done with Java in the industry?
Thanks for the patches though - it's nice
I've posted a patch with some changes here
(https://gist.github.com/4416803), it includes the record change here and
a small change to interpret-instruction, the benchmark runs 2x the
default as it did for Marshall.
The patch also modifies the main loop to use a thread pool instead of
agents
Hi Lee,
I've done some more digging and seem to have found the root of the
problem,
it seems that java native methods are much slower when called in parallel.
The following code illustrates the problem:
(letfn [(time-native [f]
(let [c (class [])]
(time (dorun
In that case isn't context switching dominating your test?
.isArray isn't expensive enough to warrant the use of pmap
Leonardo Borges
www.leonardoborges.com
On Dec 29, 2012 10:29 AM, cameron cdor...@gmail.com wrote:
Hi Lee,
I've done some more digging and seem to have found the root of the
No, it's not the context switching, changing isArray (a native method) to
getAnnotations (a normal jvm method) gives the same time for both the
parallel and serial version.
Cameron.
On Saturday, December 29, 2012 10:34:42 AM UTC+11, Leonardo Borges wrote:
In that case isn't context switching
I've been moving house for the last week or so but I'll also give the
benchmark another look.
My initial profiling seemed to show that the parallel version was spending
a significant amount of time in java.lang.isArray,
clojush.pushstate/stack-ref is calling nth on the result of cons, since it
On Dec 21, 2012, at 6:59 PM, Meikel Brandmeyer wrote:
Is there a much simpler way that I overlooked?
I'm not sure it's simpler, but it's more straight-forward, I'd say.
Thanks Marshall and Mikel on the struct-record conversion code. I'll
definitely make a change along those lines.
Wm. Josiah Erikson wmjos...@gmail.com writes:
I hope this helps people get to the bottom of things.
Not to the bottom of things yet, but found some low-hanging fruit –
switching the `push-state` from a struct-map to a record gives a flat
~2x speedup in all configurations I tested. So, that’s
On Dec 21, 2012, at 5:22 PM, Marshall Bockrath-Vandegrift wrote:
Not to the bottom of things yet, but found some low-hanging fruit –
switching the `push-state` from a struct-map to a record gives a flat
~2x speedup in all configurations I tested. So, that’s good?
I really appreciate your
Lee Spector lspec...@hampshire.edu writes:
FWIW I used records for push-states at one point but did not observe a
speedup and it required much messier code, so I reverted to
struct-maps. But maybe I wasn't doing the right timings. I'm curious
about how you changed to records without the
Hi,
Am 22.12.12 00:37, schrieb Lee Spector:
;; this is defined elsewhere, and I want push-states to have fields for each
push-type that's defined here
(def push-types '(:exec :integer :float :code :boolean :string :zip
:tag :auxiliary :return :environment)
(defn
So here's what we came up with that clearly demonstrates the problem. Lee
provided the code and I tweaked it until I believe it shows the problem
clearly and succinctly.
I have put together a .tar.gz file that has everything needed to run it,
except lein. Grab it here:
Whoops, sorry about the link. It should be able to be found here:
http://gibson.hampshire.edu/~josiah/clojush/
On Wed, Dec 19, 2012 at 11:57 AM, Wm. Josiah Erikson wmjos...@gmail.comwrote:
So here's what we came up with that clearly demonstrates the problem. Lee
provided the code and I tweaked
On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote:
I think this is a succinct, deterministic benchmark that clearly
demonstrates the problem and also doesn't use conj or reverse.
Clarification: it's not just a tight loop involving reverse/conj, as our
previous benchmark was. It's our
I tried redefining the few places in the code (string_reverse, I think)
that used reverse to use the same version of reverse that I got such great
speedups with in your code, and it made no difference. There are not any
explicit calls to conj in the code that I could find.
On Wed, Dec 19, 2012 at
Wm. Josiah Erikson wmjos...@gmail.com writes:
Then run, for instance: /usr/bin/time -f %E lein run
clojush.examples.benchmark-bowling
and then, when that has finished, edit
src/clojush/examples/benchmark_bowling.clj and uncomment
:use-single-thread true and run it again. I think this is a
On Dec 15, 2012, at 1:14 AM, cameron wrote:
Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java
for my GP work but for the last few years it's been GEVA with a clojure
wrapper I wrote (https://github.com/cdorrat/geva-clj).
Ah yes -- I've actually downloaded and
On Dec 14, 2012, at 10:41 PM, cameron wrote:
Until Lee has a representative benchmark for his application it's difficult
to tell if he's
experiencing the same problem but there would seem to be a case for changing
the PersistentList
implementation in clojure.lang.
We put together a
I've created a test harness for this as a leiningen plugin:
https://github.com/bendlas/lein-partest
You can just put
:plugins [[net.bendlas/lein-partest 0.1.0]]
into your project and run
lein partest your.ns/testfn 6
to run 6 threads/processes in parallel
The plugin then runs the
Thanks Herwig,
I used your plugin with the following 2 burn variants:
(defn burn-slow [ _]
(count (last (take 1000 (iterate #(reduce conj '() %) (range 1))
(defn burn-fast [ _]
(count (last (take 1000 (iterate #(reduce conj* (list nil) %) (range
1))
Where conj* is just a
I'd be interested in seeing your GP system. The one we're using evolves
Push programs and I suspect that whatever's triggering this problem with
multicore utilization is stemming from something in the inner loop of my
Push interpreter (https://github.com/lspector/Clojush)... but I don't
OK, I did something a little bit different, but I think it proves the same
thing we were shooting for.
On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling
benchmark:
1: multithreaded. Average of 10 runs: 14:00.9
2. singlethreaded. Average of 10 runs: 23:35.3
3. singlethreaded, 8
Ah. We'll look into running several clojures in one JVM too. Thanks.
On Thu, Dec 13, 2012 at 1:41 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote:
OK, I did something a little bit different, but I think it proves the same
thing we were shooting for.
On a 48-way 4 x Opteron 6168 with 32GB of
I'm not saying that I know this will help, but if you are open to trying a
different JVM that has had a lot of work done on it to optimize it for high
concurrency, Azul's Zing JVM may be worth a try, to see if it increases
parallelism for a single Clojure instance in a single JVM, with lots of
Cool. I've requested a free trial.
On Thu, Dec 13, 2012 at 1:53 PM, Andy Fingerhut andy.finger...@gmail.comwrote:
I'm not saying that I know this will help, but if you are open to trying a
different JVM that has had a lot of work done on it to optimize it for high
concurrency, Azul's Zing JVM
On Friday, December 14, 2012 5:41:59 AM UTC+11, Wm. Josiah Erikson wrote:
Does this help? Should I do something else as well? I'm curious to try
running like, say 16 concurrent copies on the 48-way node
Have you made any progress on a small deterministic benchmark that
reflects your
On Dec 13, 2012, at 4:21 PM, cameron wrote:
Have you made any progress on a small deterministic benchmark that reflects
your applications behaviour (ie. the RNG seed work you were discussing)? I'm
keen to help, but I don't have time to look at benchmarks that take hours to
run.
I've
Hi Marshall,
the megamorphic call site hypothesis does sound plausible but I'm not
sure where the following test fits in.
If I understand correctly we believe that it's the fact that the base case
(an PersistentList$EmptyList instance)
and the normal case (an PersistsentList instance) have
Andy Fingerhut andy.finger...@gmail.com writes:
I'm not practiced in recognizing megamorphic call sites, so I could be
missing some in the example code below, modified from Lee's original
code. It doesn't use reverse or conj, and as far as I can tell
doesn't use PersistentList, either, only
cameron cdor...@gmail.com writes:
the megamorphic call site hypothesis does sound plausible but I'm
not sure where the following test fits in.
...
I was toying with the idea of replacing the EmptyList class with a
PersistsentList instance to mitigate the problem
in at least one common
Lee:
I believe you said that with your benchmarking code achieved good speedup when
run as separate JVMs that were each running a single thread, even before making
the changes to the implementation of reverse found by Marshall. I confirmed
that on my own machine as well.
Have you tried
On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote:
Have you tried running your real application in a single thread in a JVM, and
then run multiple JVMs in parallel, to see if there is any speedup? If so,
that would again help determine whether it is multiple threads in a single
JVM
Lee, while you are at benchmarking, would you mind running several threads
in one JVM with one clojure instance per thread? Thus each thread should
get JITted independently.
Christophe
On Wed, Dec 12, 2012 at 4:11 PM, Lee Spector lspec...@hampshire.edu wrote:
On Dec 12, 2012, at 10:03 AM,
On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote:
Lee, while you are at benchmarking, would you mind running several threads in
one JVM with one clojure instance per thread? Thus each thread should get
JITted independently.
I'm not actually sure how to do that. We're starting runs with
On Thursday, December 13, 2012 12:51:57 AM UTC+11, Marshall
Bockrath-Vandegrift wrote:
cameron cdo...@gmail.com javascript: writes:
the megamorphic call site hypothesis does sound plausible but I'm
not sure where the following test fits in.
...
I was toying with the idea of
See https://github.com/flatland/classlojure for a, nearly, ready-made
solution to running several Clojures in one JVM.
On Wed, Dec 12, 2012 at 5:20 PM, Lee Spector lspec...@hampshire.edu wrote:
On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote:
Lee, while you are at benchmarking, would
nicolas.o...@gmail.com nicolas.o...@gmail.com writes:
What happens if your run it a third time at the end? (The question
is related to the fact that there appears to be transition states
between monomorphic and megamorphic call sites, which might lead to
an explanation.)
Same results, but
On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote:
I’m not sure what the next steps are. Open a bug on the JVM? This is
something one can attempt to circumvent on a case-by-case basis, but
IHMO has significant negative implications for Clojure’s concurrency
story.
I've gotten
Lee,
My reading of this thread is not quite as pessimistic as yours. Here is
my synthesis for the practical application developer in Clojure from
reading and re-reading all of the posts above. Marshall and Cameron, please
feel free to correct me if I screw anything up here royally. ;-)
When
Lee Spector lspec...@hampshire.edu writes:
Is the following a fair characterization pending further developments?
If you have a cons-intensive task then even if it can be divided into
completely independent, long-running subtasks, there is currently no
known way to get significant speedups
On Dec 11, 2012, at 11:40 AM, Marshall Bockrath-Vandegrift wrote:
Or have I missed a currently-available work-around among the many
suggestions?
You can specialize your application to avoid megamodal call sites in
tight loops. If you are working with `Cons`-order sequences, just use
Lee Spector lspec...@hampshire.edu writes:
If the application does lots of list processing but does so with a
mix of Clojure list and sequence manipulation functions, then one
would have to write private, list/cons-only versions of all of these
things? That is -- overstating it a bit, to be
OK WOW. You hit the nail on the head. It's reverse being called in a pmap
that does it. When I redefine my own version of reverse (I totally cheated
and just stole this) like this:
(defn reverse-recursively [coll]
(loop [[r more :as all] (seq coll)
acc '()]
(if all
(recur
And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7
3770K, when before it was very very much not so. So for some reason, this
bug was tickled more dramatically on AMD multicore processors than on Intel
ones.
On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson
Marshall:
I'm not practiced in recognizing megamorphic call sites, so I could be missing
some in the example code below, modified from Lee's original code. It doesn't
use reverse or conj, and as far as I can tell doesn't use PersistentList,
either, only Cons.
(defn burn-cons [size]
(let
...and, suddenly, the high-core-count Opterons show us what we wanted and
hoped for. If I increase that range statement to 100 and run it on the
48-core node, it takes 50 seconds (before it took 50 minutes), while the
FX-8350 takes 3:31.89 and the 3770K takes 3:48.95. Thanks Marshall! I think
you
Hm. Interesting. For the record, the exact code I'm running right now that
I'm seeing great parallelism with is this:
(defn reverse-recursively [coll]
(loop [[r more :as all] (seq coll)
acc '()]
(if all
(recur more (cons r acc))
acc)))
(defn burn
([] (loop [i 0
On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
So I think if you replace your calls to `reverse` and any `conj` loops
you have in your own code, you should see a perfectly reasonable
speedup.
Tantalizing, but on investigation I see that our real application actually does
Hi Marshall,
I think we're definitely on the right track.
If I replace the reverse call with the following function I get a parallel
speedup of ~7.3 on an 8 core machine.
(defn copy-to-java-list [coll]
(let [lst (java.util.LinkedList.)]
(doseq [x coll]
(.addFirst lst x))
lst))
The main GC feature here are the Thread-Local Allocation Buffers. They are
on by default and are automatically sized according to allocation
patterns. The size can also be fine-tuned with the -XX:TLABSize=nconfiguration
option. You may consider tweaking this setting to optimize
runtime.
cameron cdor...@gmail.com writes:
There does seem to be something unusual about conj and
clojure.lang.PersistentList in this parallel test case and I don't
think it's related to the JVMs memory allocation.
I’ve got a few more data-points, but still no handle on what exactly is
going on.
My
- Parallel allocation of `Cons` and `PersistentList` instances through
a Clojure `conj` function remains fast as long as the function only
ever returns objects of a single concrete type
A possible explanation for this could be JIT Deoptimization. Deoptimization
happens when the
Aha. Not only do I get a lot of made not entrant, I get a lot of made
zombie. However, I get this for both runs with map and with pmap (and with
pmapall as well)
For instance, from a pmapall run:
33752 159 clojure.lang.Cons::next (10 bytes) made zombie
33752 164
I tried some more performance tuning options in Java, just for kicks, and
didn't get any advantages from them: -server -XX:+TieredCompilation
-XX:ReservedCodeCacheSize=256m
Also, in case it's informative:
[josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-1.txt
| wc -l
173
Wm. Josiah Erikson wmjos...@gmail.com writes:
Aha. Not only do I get a lot of made not entrant, I get a lot of
made zombie. However, I get this for both runs with map and with
pmap (and with pmapall as well)
I’m not sure this is all that enlightening. From what I can gather,
“made not
Interesting. I tried the following:
:jvm-opts [-Xmx10g -Xms10g -XX:+AggressiveOpts -server
-XX:+TieredCompilation -XX:ReservedCodeCacheSize=256m -XX:TLABSize=1G
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseParNewGC
-XX:+ResizeTLAB -XX:+UseTLAB]
I got a slight slowdown, and the GC details
Hi Lee,
Would it be difficult to try the following version of 'pmap'? It doesn't
use futures but executors instead so at least this could help narrow the
problem down... If the problem is due to the high number of futures
spawned by pmap then this should fix it...
(defn- with-thread-pool*
cameron cdor...@gmail.com writes:
Interesting problem, the slowdown seems to being caused by the reverse
call (actually the calls to conj with a list argument).
Excellent analysis, sir! I think this points things in the right
direction.
fast-reverse : map-ms: 3.3, pmap-ms 0.7, speedup
If the number of object allocation mentioned earlier in this thread are real,
yes vm heap management can be a bottleneck. There has to be some
locking done somewhere otherwise the heap would corrupt :)
The other bottleneck can come from garbage collection which has to freeze
object allocation
On Dec 8, 2012, at 9:37 PM, Lee Spector wrote:
On Dec 8, 2012, at 10:19 PM, meteorfox wrote:
Now if you run vmstat 1 while running your benchmark you'll notice that the
run queue will be most of the time at 8, meaning that 8 processes are
waiting for CPU, and this is due to memory
On Dec 9, 2012, at 4:48 AM, Marshall Bockrath-Vandegrift wrote:
It’s like there’s a lock of some sort sneaking in on the `conj` path.
Any thoughts on what that could be?
My current best guess is the JVM's memory allocator, not Clojure code.
Andy
--
You received this message because you
Andy Fingerhut andy.finger...@gmail.com writes:
My current best guess is the JVM's memory allocator, not Clojure code.
I didn’t mean to imply the problem was in Clojure itself, but I don’t
believe the issue is in the memory allocator either. I now believe the
problem is in a class of JIT
On Dec 9, 2012, at 6:25 AM, Softaddicts wrote:
If the number of object allocation mentioned earlier in this thread are real,
yes vm heap management can be a bottleneck. There has to be some
locking done somewhere otherwise the heap would corrupt :)
The other bottleneck can come from
There's no magic here, everyone tuning their app hit this wall eventually,
tweaking the JVM memory options :)
Luc
On Dec 9, 2012, at 6:25 AM, Softaddicts wrote:
If the number of object allocation mentioned earlier in this thread are
real,
yes vm heap management can be a bottleneck.
Even though this is very surprising (and sad) to hear, I'm afraid I've
got different experiences... My reducer-based parallel minimax is about
3x faster than the serial one, on my 4-core AMD phenom II and a tiny bit
faster on my girlfriend's intel i5 (2 physical cores + 2 virtual). I'm
Lee Spector lspec...@hampshire.edu writes:
I'm also aware that the test that produced the data I give below,
insofar as it uses pmap to do the distribution, may leave cores idle
for a bit if some tasks take a lot longer than others, because of the
way that pmap allocates cores to threads.
On Dec 7, 2012, at 9:42 PM, Andy Fingerhut wrote:
When you say we can run multiple instances of the test on the same machine,
do you mean that, for example, on an 8 core machine you run 8 different JVMs
in parallel, each doing a single-threaded 'map' in your Clojure code and not
a
On Dec 8, 2012, at 9:36 AM, Marshall Bockrath-Vandegrift wrote:
Although it doesn’t impact your benchmark, `pmap` may be further
adversely affecting the performance of your actual program. There’s a
open bug regarding `pmap` and chunked seqs:
http://dev.clojure.org/jira/browse/CLJ-862
My experiences in the past are similar to the numbers that Jim is reporting.
I have recently been centering most of my crunching code around reducers.
Is it possible for you to cook up a small representative test using
reducers+fork/join (and potentially primitives in the intermediate steps)?
On Dec 8, 2012, at 1:28 PM, Paul deGrandis wrote:
My experiences in the past are similar to the numbers that Jim is reporting.
I have recently been centering most of my crunching code around reducers.
Is it possible for you to cook up a small representative test using
reducers+fork/join
Andy: The short answer is yes, and we saw huge speedups. My latest post, as
well as Lee's, has details.
On Friday, December 7, 2012 9:42:03 PM UTC-5, Andy Fingerhut wrote:
On Dec 7, 2012, at 5:25 PM, Lee Spector wrote:
Another strange observation is that we can run multiple instances
Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running
things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4
945, and found some very strange results, which I shall post here, after I
explain a little function that Lee wrote that is designed to get improved
I haven't analyzed your results in detail, but here are some results I had on
my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011.
When running multiple threads within a single JVM invocation, I never got a
speedup of even 2. The highest speedup I measured was 1.82 speedup when I ran
8
On Dec 7, 2012, at 5:25 PM, Lee Spector wrote:
The test: I wrote a time-consuming function that just does a bunch of math
and list manipulation (which is what takes a lot of time in my real
applications):
(defn burn
([] (loop [i 0
value '()]
(if (= i 1)
On Dec 8, 2012, at 3:42 PM, Andy Fingerhut wrote:
I'm hoping you realize that (take 1 (iterate reverse value)) is reversing
a linked list 1 times, each time allocating 1 cons cells (or
Clojure's equivalent of a cons cell)? For a total of around 100,000,000
memory allocations
I'm glad somebody else can duplicate our findings! I get results similar to
this on Intel hardware. On AMD hardware, the disparity is bigger, and
multiple threads of a single JVM invocation on AMD hardware consistently
gives me slowdowns as compared to a single thread. Also, your results are
Just tried, my first foray into reducers, but I must not be understanding
something correctly:
(time (r/map burn (doall (range 4
returns in less than a second on my macbook pro, whereas
(time (doall (map burn (range 4
takes nearly a minute.
This feels like unforced
Lee:
So I ran
On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote:
I've been running compute intensive (multi-day), highly parallelizable
Clojure processes on high-core-count machines and blithely assuming that
since I saw near maximal CPU utilization in top and the like that I was
Lee:
I ran Linux perf and also watched the run queue (with vmstat) and your
bottleneck is basically memory access. The CPUs are idle 80% of the time by
stalled cycles. Here's what I got on my machine.
Intel Core i7 4 cores with Hyper thread (8 virtual processors)
16 GiB of Memory
Oracle JVM
Correction regarding the run-queue, this is not completely correct, :S .
But the stalled cycles and memory accesses still holds.
Sorry for the misinformation.
On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote:
I've been running compute intensive (multi-day), highly parallelizable
On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote:
Yep, reducers, don't use lazy seqs. But they return just sth. like
transformed functions, that will be applied when building the collection. So
you can use them like this:
(into [] (r/map burn (doall (range 4)
See
On Dec 8, 2012, at 10:19 PM, meteorfox wrote:
Now if you run vmstat 1 while running your benchmark you'll notice that the
run queue will be most of the time at 8, meaning that 8 processes are
waiting for CPU, and this is due to memory accesses (in this case, since this
is not true for
Interesting problem, the slowdown seems to being caused by the reverse call
(actually the calls to conj with a list argument).
Calling conj in a multi-threaded environment seems to have a significant
performance impact when using lists
I created some alternate reverse implementations (the
I forgot to mention, I cut the number of reverse iterations down to 1000
(not 1) so I wouldn't have to wait too long for criterium, the speedup
numbers are representative of the full test though.
Cameron.
On Sunday, December 9, 2012 6:26:16 PM UTC+11, cameron wrote:
Interesting
Lee:
I'll just give a brief description right now, but one thing I've found in the
past on a 2-core machine that was achieving much less than 2x speedup was
memory bandwidth being the limiting factor.
Not all Clojure code allocates memory, but a lot does. If the hardware in a
system can
Thanks Andy.
My applications definitely allocate a lot of memory, which is reflected in all
of that consing in the test I was using. It'd be hard to do what we do in any
other way. I can see how a test using a Java mutable array would help to
diagnose the problem, but if that IS the problem
1 - 100 of 101 matches
Mail list logo