import numpy as np def id(x): return x
rdd = sc.parallelize(np.arange(1000)) rdd = rdd.map(lambda x: (x,1)) rdd = rdd.partitionBy(8, id) rdd = rdd.cache().setName('milestone') rdd.join(rdd).collect() The above code generates this DAG: <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27429/Screenshot_from_2016-07-29_20-48-56.png> Zoom in Stage 13: <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27429/Screenshot_from_2016-07-29_20-54-21.png> Zoom in Stage 14: <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27429/Screenshot_from_2016-07-29_20-55-50.png> The green box is cached 'milestone'. Normally, it should contain partition information. However, there is still shuffling in `join()`. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-1-6-1-partitionBy-does-not-provide-meaningful-information-for-join-to-use-tp27429.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org