Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Reinis Vicups Mon, 13 Oct 2014 08:57:11 -0700

Hi,

I am currently testing SimilarityAnalysis.rowSimilarity and I amwondering, how could I increase number of tasks to use for distributedshuffle.

What I currently observe, is that SimilarityAnalysis is requiring almost20 minutes for my dataset only with this stage:


combineByKey at ABt.scala:126

When I view details for the stage I see that only one task is spawnedrunning on one node.

I have my own implementation of SimilarityAnalysis and by tuning numberof tasks I have reached HUGE performance gains.

Since I couldn't find how to pass the number of tasks to shuffleoperations directly, I have set following in spark config


configuration = new SparkConf().setAppName(jobConfig.jobName)

.set("spark.serializer","org.apache.spark.serializer.KryoSerializer").set("spark.kryo.registrator","org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")

        .set("spark.kryo.referenceTracking", "false")
        .set("spark.kryoserializer.buffer.mb", "200")

.set("spark.default.parallelism", 400) // <- this is the linesupposed to set default parallelism to some high number


Thank you for your help
reinis

Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Reply via email to