Re: number of partitions in join: Spark documentation misleading!

2015-06-16 Thread Davies Liu
Please file a JIRA for it.

On Mon, Jun 15, 2015 at 8:00 AM, mrm ma...@skimlinks.com wrote:
 Hi all,

 I was looking for an explanation on the number of partitions for a joined
 rdd.

 The documentation of Spark 1.3.1. says that:
 For distributed shuffle operations like reduceByKey and join, the largest
 number of partitions in a parent RDD.
 https://spark.apache.org/docs/latest/configuration.html

 And the Partitioner.scala comments (line 51) state that:
 Unless spark.default.parallelism is set, the number of partitions will be
 the same as the number of partitions in the largest upstream RDD, as this
 should be least likely to cause out-of-memory errors.

 But this is misleading for the Python API where if you do rddA.join(rddB),
 the output number of partitions is the number of partitions of A plus the
 number of partitions of B!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



number of partitions in join: Spark documentation misleading!

2015-06-15 Thread mrm
Hi all,

I was looking for an explanation on the number of partitions for a joined
rdd.

The documentation of Spark 1.3.1. says that:
For distributed shuffle operations like reduceByKey and join, the largest
number of partitions in a parent RDD.
https://spark.apache.org/docs/latest/configuration.html

And the Partitioner.scala comments (line 51) state that:
Unless spark.default.parallelism is set, the number of partitions will be
the same as the number of partitions in the largest upstream RDD, as this
should be least likely to cause out-of-memory errors.

But this is misleading for the Python API where if you do rddA.join(rddB),
the output number of partitions is the number of partitions of A plus the
number of partitions of B!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org