Looking at python/pyspark/sql/dataframe.py : @since(1.4) def coalesce(self, numPartitions):
@since(1.3) def repartition(self, numPartitions): Would the above methods serve the purpose ? Cheers On Fri, May 22, 2015 at 6:57 AM, Karlson <ksonsp...@siberie.de> wrote: > Alright, that doesn't seem to have made it into the Python API yet. > > > On 2015-05-22 15:12, Silvio Fiorito wrote: > >> This is added to 1.4.0 >> >> https://github.com/apache/spark/pull/5762 >> >> >> >> >> >> >> >> On 5/22/15, 8:48 AM, "Karlson" <ksonsp...@siberie.de> wrote: >> >> Hi, >>> >>> wouldn't df.rdd.partitionBy() return a new RDD that I would then need to >>> make into a Dataframe again? Maybe like this: >>> df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird >>> to me, though, and I'm not sure if the DF will be aware of its >>> partitioning. >>> >>> On 2015-05-22 12:55, ayan guha wrote: >>> >>>> DataFrame is an abstraction of rdd. So you should be able to do >>>> df.rdd.partitioyBy. however as far as I know, equijoines already >>>> optimizes >>>> partitioning. You may want to look explain plans more carefully and >>>> materialise interim joins. >>>> On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote: >>>> >>>> Hi, >>>>> >>>>> is there any way to control how Dataframes are partitioned? I'm doing >>>>> lots >>>>> of joins and am seeing very large shuffle reads and writes in the >>>>> Spark UI. >>>>> With PairRDDs you can control how the data is partitioned across nodes >>>>> with >>>>> partitionBy. There is no such method on Dataframes however. Can I >>>>> somehow >>>>> partition the underlying the RDD manually? I am currently using the >>>>> Python >>>>> API. >>>>> >>>>> Thanks! >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >