Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Reinis Vicups Mon, 13 Oct 2014 09:33:47 -0700

Hi,

Do you think that simply increasing this parameter is a safe and sane thing
to do?


Why would it be unsafe?

In my own implementation I am using 400 tasks on my 4-node-2cpu clusterand the execution times of largest shuffle stage have dropped around 10times.I have number of test values back from the time when I used "old"RowSimilarityJob and with some exceptions (I guess due to randomizedsparsization) I still have approx. the same values with my own rowsimilarity implementation.


reinis

On 13.10.2014 18:06, Ted Dunning wrote:

On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <[email protected]> wrote:

I have my own implementation of SimilarityAnalysis and by tuning number of
tasks I have reached HUGE performance gains.

Since I couldn't find how to pass the number of tasks to shuffle
operations directly, I have set following in spark config

configuration = new SparkConf().setAppName(jobConfig.jobName)
         .set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
         .set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io
.MahoutKryoRegistrator")
         .set("spark.kryo.referenceTracking", "false")
         .set("spark.kryoserializer.buffer.mb", "200")
         .set("spark.default.parallelism", 400) // <- this is the line
supposed to set default parallelism to some high number

Thank you for your help

Thank you for YOUR help!

Do you think that simply increasing this parameter is a safe and sane thing
to do?

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Reply via email to