Just an FYI, it's not obvious from the docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that the following code should fail:
a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) a._jrdd.splits().size() a.count() b = a.partitionBy(5) b._jrdd.splits().size() b.count() I figured out from the example that if I generated a key by doing this b = a.map(lambda x: (x, x)).partitionBy(5) then all would be well. In other words, partitionBy() only works on RDDs of tuples. Is that correct? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html Sent from the Apache Spark User List mailing list archive at Nabble.com.