Hi,
I am currently testing SimilarityAnalysis.rowSimilarity and I am
wondering, how could I increase number of tasks to use for distributed
shuffle.
What I currently observe, is that SimilarityAnalysis is requiring almost
20 minutes for my dataset only with this stage:
combineByKey at ABt.scala:126
When I view details for the stage I see that only one task is spawned
running on one node.
I have my own implementation of SimilarityAnalysis and by tuning number
of tasks I have reached HUGE performance gains.
Since I couldn't find how to pass the number of tasks to shuffle
operations directly, I have set following in spark config
configuration = new SparkConf().setAppName(jobConfig.jobName)
.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
.set("spark.kryo.referenceTracking", "false")
.set("spark.kryoserializer.buffer.mb", "200")
.set("spark.default.parallelism", 400) // <- this is the line
supposed to set default parallelism to some high number
Thank you for your help
reinis