Hm, yeah, the docs are not clear on this one. The function you're looking for to change the number of partitions on any ol' RDD is "repartition()", which is available in master but for some reason doesn't seem to show up in the latest docs. Sorry about that, I also didn't realize partitionBy() had this behavior from reading the Python docs (though it is consistent with the Scala API, just more type-safe there).
On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Just an FYI, it's not obvious from the > docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that > the following code should fail: > > a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) > a._jrdd.splits().size() > a.count() > b = a.partitionBy(5) > b._jrdd.splits().size() > b.count() > > I figured out from the example that if I generated a key by doing this > > b = a.map(lambda x: (x, x)).partitionBy(5) > > then all would be well. > > In other words, partitionBy() only works on RDDs of tuples. Is that > correct? > > Nick > > > ------------------------------ > View this message in context: PySpark RDD.partitionBy() requires an RDD > of > tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html> > Sent from the Apache Spark User List mailing list > archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >