Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Reinis Vicups Mon, 13 Oct 2014 10:08:01 -0700

Of course the number of partitions/tasks shall be configurable, I amjust saying that in my experiments I have observed a close-to-linearperformance increase just by increasing number of partitions/tasks(which was absolutely not the case with map-reduce).

I am assuming that spark is not "smart" enough to set optimal values forthe parallelism. I recall reading someplace that the default is numberof CPUs or 2 - whatever is larger. Because of the task nature (if I amnot mistaken, those are wrapped akka actors) it is possible toefficiently execute a way higher number of tasks per CPU. They suggestthishttp://spark.apache.org/docs/latest/tuning.html#level-of-parallelism butI have observed sometimes considerable performance gains when increasingnumber of tasks to 8 or even to 16 per CPU core.


On 13.10.2014 18:53, Pat Ferrel wrote:

There is a possibility that we are doing something with partitioning that 
interferes but I think Ted’s point is that Spark should do the right thing in 
most cases—unless we interfere. Those values are meant for tuning to the exact 
job you are doing, but it may not be appropriate for us to hard code them. We 
could allow the CLI to set them like we do with -sem if needed.

Let’s see what Dmitriy thinks about why only one task is being created.

On Oct 13, 2014, at 9:32 AM, Reinis Vicups <[email protected]> wrote:

Hi,

Do you think that simply increasing this parameter is a safe and sane thing
to do?

Why would it be unsafe?

In my own implementation I am using 400 tasks on my 4-node-2cpu cluster and the 
execution times of largest shuffle stage have dropped around 10 times.
I have number of test values back from the time when I used "old" 
RowSimilarityJob and with some exceptions (I guess due to randomized sparsization) I 
still have approx. the same values with my own row similarity implementation.

reinis

On 13.10.2014 18:06, Ted Dunning wrote:

On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <[email protected]> wrote:

I have my own implementation of SimilarityAnalysis and by tuning number of
tasks I have reached HUGE performance gains.

Since I couldn't find how to pass the number of tasks to shuffle
operations directly, I have set following in spark config

configuration = new SparkConf().setAppName(jobConfig.jobName)
         .set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
         .set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io
.MahoutKryoRegistrator")
         .set("spark.kryo.referenceTracking", "false")
         .set("spark.kryoserializer.buffer.mb", "200")
         .set("spark.default.parallelism", 400) // <- this is the line
supposed to set default parallelism to some high number

Thank you for your help

Thank you for YOUR help!

Do you think that simply increasing this parameter is a safe and sane thing
to do?

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Reply via email to