This is added to 1.4.0 https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, "Karlson" <ksonsp...@siberie.de> wrote: >Hi, > >wouldn't df.rdd.partitionBy() return a new RDD that I would then need to >make into a Dataframe again? Maybe like this: >df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird >to me, though, and I'm not sure if the DF will be aware of its >partitioning. > >On 2015-05-22 12:55, ayan guha wrote: >> DataFrame is an abstraction of rdd. So you should be able to do >> df.rdd.partitioyBy. however as far as I know, equijoines already >> optimizes >> partitioning. You may want to look explain plans more carefully and >> materialise interim joins. >> On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote: >> >>> Hi, >>> >>> is there any way to control how Dataframes are partitioned? I'm doing >>> lots >>> of joins and am seeing very large shuffle reads and writes in the >>> Spark UI. >>> With PairRDDs you can control how the data is partitioned across nodes >>> with >>> partitionBy. There is no such method on Dataframes however. Can I >>> somehow >>> partition the underlying the RDD manually? I am currently using the >>> Python >>> API. >>> >>> Thanks! >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >