Re: Partitioning of Dataframes

Ted Yu Fri, 22 May 2015 09:11:56 -0700

Looking at python/pyspark/sql/dataframe.py :

    @since(1.4)
    def coalesce(self, numPartitions):


    @since(1.3)
    def repartition(self, numPartitions):

Would the above methods serve the purpose ?

Cheers

On Fri, May 22, 2015 at 6:57 AM, Karlson <ksonsp...@siberie.de> wrote:

> Alright, that doesn't seem to have made it into the Python API yet.
>
>
> On 2015-05-22 15:12, Silvio Fiorito wrote:
>
>> This is added to 1.4.0
>>
>> https://github.com/apache/spark/pull/5762
>>
>>
>>
>>
>>
>>
>>
>> On 5/22/15, 8:48 AM, "Karlson" <ksonsp...@siberie.de> wrote:
>>
>>  Hi,
>>>
>>> wouldn't df.rdd.partitionBy() return a new RDD that I would then need to
>>> make into a Dataframe again? Maybe like this:
>>> df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird
>>> to me, though, and I'm not sure if the DF will be aware of its
>>> partitioning.
>>>
>>> On 2015-05-22 12:55, ayan guha wrote:
>>>
>>>> DataFrame is an abstraction of rdd. So you should be able to do
>>>> df.rdd.partitioyBy. however as far as I know, equijoines already
>>>> optimizes
>>>> partitioning. You may want to look explain plans more carefully and
>>>> materialise interim joins.
>>>>  On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> is there any way to control how Dataframes are partitioned? I'm doing
>>>>> lots
>>>>> of joins and am seeing very large shuffle reads and writes in the
>>>>> Spark UI.
>>>>> With PairRDDs you can control how the data is partitioned across nodes
>>>>> with
>>>>> partitionBy. There is no such method on Dataframes however. Can I
>>>>> somehow
>>>>> partition the underlying the RDD manually? I am currently using the
>>>>> Python
>>>>> API.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Partitioning of Dataframes

Reply via email to