Re: Pyspark Partitioning

Riccardo Ferrari Sun, 30 Sep 2018 12:29:50 -0700

Hi Dimitris,

I believe the methods partitionBy
<https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD.partitionBy>
and mapPartitions
<https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD.mapPartitions>
are specific to RDDs while you're talking about DataFrames
<https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame>.
I guess you have few options including:
1. use the Dataframe.rdd
<https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.rdd>
call and process the returned RDD. Please note the return type for this
call is and RDD of Row
2. User the groupBy
<https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy>
from Dataframes and start from there, this may involved defining an udf or
leverage on the existing GroupedData
<https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData>
functions.


It really depends on your use-case and your performance requirements.
HTH

On Sun, Sep 30, 2018 at 8:31 PM dimitris plakas <dimitrisp...@gmail.com>
wrote:

> Hello everyone,
>
> I am trying to split a dataframe on partitions and i want to apply a
> custom function on every partition. More precisely i have a dataframe like
> the one below
>
> Group_Id | Id | Points
> 1            | id1| Point1
> 2            | id2| Point2
>
> I want to have a partition for every Group_Id and apply on every partition
> a function defined by me.
> I have tried with partitionBy('Group_Id').mapPartitions() but i receive
> error.
> Could you please advice me how to do it?
>

Re: Pyspark Partitioning

Reply via email to