“Using spark 0.8.1 … jave code running on 8 CPU with 16GRAM single node”
Local or standalone(single node) ? 发件人: [email protected] [mailto:[email protected]] 发送时间: 2014年1月14日 13:42 收件人: user 主题: Re: squestion on using spark parallelism vs using num partitions in spark api I think the parallelism param just control how many tasks could be run together in each work. it could't control how many tasks should be split . ________________________________ [email protected]<mailto:[email protected]> From: [email protected]<mailto:[email protected]> Date: 2014-01-14 09:17 To: [email protected]<mailto:[email protected]> Subject: squestion on using spark parallelism vs using num partitions in spark api Hi, Using spark 0.8.1 … jave code running on 8 CPU with 16GRAM single node It’s looks like upon setting spark parallelism using System.setProperty("spark.default.parallelism", 24) before creating my spark context as described in http://spark.incubator.apache.org/docs/latest/tuning.html#level-of-parallelism has no effect on the default number of partitions that spark uses in its api’s like saveAsTextFile() . For example if I set spark.default.parallelism to 24, I was expecting 24 tasks to be invoked upon calling saveAsTextFile() but it’s not the case as I am seeing only 1 task get invoked If I set my RDD parallelize() to 2 as dataSetRDD = SparkDriver.getSparkContext().parallelize(mydata,2); then invoke dataSetRDD.saveAsTextFile(JavaRddFilePath); I am seeing 2 tasks get invoked even my spark.default.parallelism was set to 24 Can someone explain the above behavior? Thanks, Hussam
