PySpark RDD.partitionBy() requires an RDD of tuples

Nicholas Chammas Tue, 01 Apr 2014 15:01:51 -0700

Just an FYI, it's not obvious from the
docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that
the following code should fail:


a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
a._jrdd.splits().size()
a.count()
b = a.partitionBy(5)
b._jrdd.splits().size()
b.count()

I figured out from the example that if I generated a key by doing this

b = a.map(lambda x: (x, x)).partitionBy(5)

then all would be well.

In other words, partitionBy() only works on RDDs of tuples. Is that correct?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

PySpark RDD.partitionBy() requires an RDD of tuples

Reply via email to