Hi,

I am currently testing SimilarityAnalysis.rowSimilarity and I am wondering, how could I increase number of tasks to use for distributed shuffle.

What I currently observe, is that SimilarityAnalysis is requiring almost 20 minutes for my dataset only with this stage:

combineByKey at ABt.scala:126

When I view details for the stage I see that only one task is spawned running on one node.

I have my own implementation of SimilarityAnalysis and by tuning number of tasks I have reached HUGE performance gains.

Since I couldn't find how to pass the number of tasks to shuffle operations directly, I have set following in spark config

configuration = new SparkConf().setAppName(jobConfig.jobName)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
        .set("spark.kryo.referenceTracking", "false")
        .set("spark.kryoserializer.buffer.mb", "200")
.set("spark.default.parallelism", 400) // <- this is the line supposed to set default parallelism to some high number

Thank you for your help
reinis

Reply via email to