Re: PySpark RDD.partitionBy() requires an RDD of tuples

Aaron Davidson Tue, 01 Apr 2014 16:40:22 -0700

Hm, yeah, the docs are not clear on this one. The function you're looking
for to change the number of partitions on any ol' RDD is "repartition()",
which is available in master but for some reason doesn't seem to show up in
the latest docs. Sorry about that, I also didn't realize partitionBy() had
this behavior from reading the Python docs (though it is consistent with
the Scala API, just more type-safe there).



On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Just an FYI, it's not obvious from the 
> docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that
>  the following code should fail:
>
> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
> a._jrdd.splits().size()
> a.count()
> b = a.partitionBy(5)
> b._jrdd.splits().size()
> b.count()
>
> I figured out from the example that if I generated a key by doing this
>
> b = a.map(lambda x: (x, x)).partitionBy(5)
>
>  then all would be well.
>
> In other words, partitionBy() only works on RDDs of tuples. Is that
> correct?
>
> Nick
>
>
> ------------------------------
> View this message in context: PySpark RDD.partitionBy() requires an RDD
> of 
> tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html>
> Sent from the Apache Spark User List mailing list 
> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>

Re: PySpark RDD.partitionBy() requires an RDD of tuples

Reply via email to