Re: Partitioning of Dataframes

Silvio Fiorito Fri, 22 May 2015 06:13:53 -0700

This is added to 1.4.0

https://github.com/apache/spark/pull/5762








On 5/22/15, 8:48 AM, "Karlson" <ksonsp...@siberie.de> wrote:

>Hi,
>
>wouldn't df.rdd.partitionBy() return a new RDD that I would then need to 
>make into a Dataframe again? Maybe like this: 
>df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird 
>to me, though, and I'm not sure if the DF will be aware of its 
>partitioning.
>
>On 2015-05-22 12:55, ayan guha wrote:
>> DataFrame is an abstraction of rdd. So you should be able to do
>> df.rdd.partitioyBy. however as far as I know, equijoines already 
>> optimizes
>> partitioning. You may want to look explain plans more carefully and
>> materialise interim joins.
>>  On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:
>> 
>>> Hi,
>>> 
>>> is there any way to control how Dataframes are partitioned? I'm doing 
>>> lots
>>> of joins and am seeing very large shuffle reads and writes in the 
>>> Spark UI.
>>> With PairRDDs you can control how the data is partitioned across nodes 
>>> with
>>> partitionBy. There is no such method on Dataframes however. Can I 
>>> somehow
>>> partition the underlying the RDD manually? I am currently using the 
>>> Python
>>> API.
>>> 
>>> Thanks!
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> 
>>> 
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Partitioning of Dataframes

Reply via email to