Of course the number of partitions/tasks shall be configurable, I am
just saying that in my experiments I have observed a close-to-linear
performance increase just by increasing number of partitions/tasks
(which was absolutely not the case with map-reduce).
I am assuming that spark is not "smart" enough to set optimal values for
the parallelism. I recall reading someplace that the default is number
of CPUs or 2 - whatever is larger. Because of the task nature (if I am
not mistaken, those are wrapped akka actors) it is possible to
efficiently execute a way higher number of tasks per CPU. They suggest
this
http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism but
I have observed sometimes considerable performance gains when increasing
number of tasks to 8 or even to 16 per CPU core.
On 13.10.2014 18:53, Pat Ferrel wrote:
There is a possibility that we are doing something with partitioning that
interferes but I think Ted’s point is that Spark should do the right thing in
most cases—unless we interfere. Those values are meant for tuning to the exact
job you are doing, but it may not be appropriate for us to hard code them. We
could allow the CLI to set them like we do with -sem if needed.
Let’s see what Dmitriy thinks about why only one task is being created.
On Oct 13, 2014, at 9:32 AM, Reinis Vicups <[email protected]> wrote:
Hi,
Do you think that simply increasing this parameter is a safe and sane thing
to do?
Why would it be unsafe?
In my own implementation I am using 400 tasks on my 4-node-2cpu cluster and the
execution times of largest shuffle stage have dropped around 10 times.
I have number of test values back from the time when I used "old"
RowSimilarityJob and with some exceptions (I guess due to randomized sparsization) I
still have approx. the same values with my own row similarity implementation.
reinis
On 13.10.2014 18:06, Ted Dunning wrote:
On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <[email protected]> wrote:
I have my own implementation of SimilarityAnalysis and by tuning number of
tasks I have reached HUGE performance gains.
Since I couldn't find how to pass the number of tasks to shuffle
operations directly, I have set following in spark config
configuration = new SparkConf().setAppName(jobConfig.jobName)
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io
.MahoutKryoRegistrator")
.set("spark.kryo.referenceTracking", "false")
.set("spark.kryoserializer.buffer.mb", "200")
.set("spark.default.parallelism", 400) // <- this is the line
supposed to set default parallelism to some high number
Thank you for your help
Thank you for YOUR help!
Do you think that simply increasing this parameter is a safe and sane thing
to do?