There is a possibility that we are doing something with partitioning that interferes but I think Ted’s point is that Spark should do the right thing in most cases—unless we interfere. Those values are meant for tuning to the exact job you are doing, but it may not be appropriate for us to hard code them. We could allow the CLI to set them like we do with -sem if needed.
Let’s see what Dmitriy thinks about why only one task is being created. On Oct 13, 2014, at 9:32 AM, Reinis Vicups <[email protected]> wrote: Hi, > Do you think that simply increasing this parameter is a safe and sane thing > to do? Why would it be unsafe? In my own implementation I am using 400 tasks on my 4-node-2cpu cluster and the execution times of largest shuffle stage have dropped around 10 times. I have number of test values back from the time when I used "old" RowSimilarityJob and with some exceptions (I guess due to randomized sparsization) I still have approx. the same values with my own row similarity implementation. reinis On 13.10.2014 18:06, Ted Dunning wrote: > On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <[email protected]> wrote: > >> I have my own implementation of SimilarityAnalysis and by tuning number of >> tasks I have reached HUGE performance gains. >> >> Since I couldn't find how to pass the number of tasks to shuffle >> operations directly, I have set following in spark config >> >> configuration = new SparkConf().setAppName(jobConfig.jobName) >> .set("spark.serializer", "org.apache.spark.serializer. >> KryoSerializer") >> .set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io >> .MahoutKryoRegistrator") >> .set("spark.kryo.referenceTracking", "false") >> .set("spark.kryoserializer.buffer.mb", "200") >> .set("spark.default.parallelism", 400) // <- this is the line >> supposed to set default parallelism to some high number >> >> Thank you for your help >> > Thank you for YOUR help! > > Do you think that simply increasing this parameter is a safe and sane thing > to do? >
