>From the Spark tuning guide<http://spark.apache.org/docs/latest/tuning.html>
:

In general, we recommend 2-3 tasks per CPU core in your cluster.


I think you can only get one task per partition to run concurrently for a
given RDD. So if your RDD has 10 partitions, then 10 tasks at most can
operate on it concurrently (given you have 10 cores available). At the same
time, you want each task to run quickly on a small slice of data. The
smaller the slice of data, the more likely the task working on it will
complete successfully. So it's good to have more, smaller partitions.

For example, if you have 10 cores and an RDD with 20 partitions, you should
see 2 waves of 10 tasks operate on the RDD concurrently.

Hence, I interpret that line from the tuning guide to mean it's best to
have your RDD partitioned into (numCores * 2) or (numCores * 3) partitions.

Nick


On Wed, Apr 16, 2014 at 7:50 PM, Joe L <selme...@yahoo.com> wrote:

> Is it true that it is better to choose the number of partition according to
> the number of nodes in the cluster?
>
> partitionBy(numNodes)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to