Hi,
Do you think that simply increasing this parameter is a safe and sane thing
to do?
Why would it be unsafe?
In my own implementation I am using 400 tasks on my 4-node-2cpu cluster
and the execution times of largest shuffle stage have dropped around 10
times.
I have number of test values back from the time when I used "old"
RowSimilarityJob and with some exceptions (I guess due to randomized
sparsization) I still have approx. the same values with my own row
similarity implementation.
reinis
On 13.10.2014 18:06, Ted Dunning wrote:
On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <[email protected]> wrote:
I have my own implementation of SimilarityAnalysis and by tuning number of
tasks I have reached HUGE performance gains.
Since I couldn't find how to pass the number of tasks to shuffle
operations directly, I have set following in spark config
configuration = new SparkConf().setAppName(jobConfig.jobName)
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io
.MahoutKryoRegistrator")
.set("spark.kryo.referenceTracking", "false")
.set("spark.kryoserializer.buffer.mb", "200")
.set("spark.default.parallelism", 400) // <- this is the line
supposed to set default parallelism to some high number
Thank you for your help
Thank you for YOUR help!
Do you think that simply increasing this parameter is a safe and sane thing
to do?