Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Pat Ferrel Mon, 13 Oct 2014 09:54:29 -0700

There is a possibility that we are doing something with partitioning that 
interferes but I think Ted’s point is that Spark should do the right thing in 
most cases—unless we interfere. Those values are meant for tuning to the exact 
job you are doing, but it may not be appropriate for us to hard code them. We 
could allow the CLI to set them like we do with -sem if needed.

Let’s see what Dmitriy thinks about why only one task is being created.

On Oct 13, 2014, at 9:32 AM, Reinis Vicups <[email protected]> wrote:

Hi,

> Do you think that simply increasing this parameter is a safe and sane thing
> to do?

Why would it be unsafe?

In my own implementation I am using 400 tasks on my 4-node-2cpu cluster and the 
execution times of largest shuffle stage have dropped around 10 times.
I have number of test values back from the time when I used "old" 
RowSimilarityJob and with some exceptions (I guess due to randomized 
sparsization) I still have approx. the same values with my own row similarity 
implementation.

reinis

On 13.10.2014 18:06, Ted Dunning wrote:
> On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups <[email protected]> wrote:
> 
>> I have my own implementation of SimilarityAnalysis and by tuning number of
>> tasks I have reached HUGE performance gains.
>> 
>> Since I couldn't find how to pass the number of tasks to shuffle
>> operations directly, I have set following in spark config
>> 
>> configuration = new SparkConf().setAppName(jobConfig.jobName)
>>         .set("spark.serializer", "org.apache.spark.serializer.
>> KryoSerializer")
>>         .set("spark.kryo.registrator", "org.apache.mahout.sparkbindings.io
>> .MahoutKryoRegistrator")
>>         .set("spark.kryo.referenceTracking", "false")
>>         .set("spark.kryoserializer.buffer.mb", "200")
>>         .set("spark.default.parallelism", 400) // <- this is the line
>> supposed to set default parallelism to some high number
>> 
>> Thank you for your help
>> 
> Thank you for YOUR help!
> 
> Do you think that simply increasing this parameter is a safe and sane thing
> to do?
>

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

Reply via email to