Hi Dimitris, I believe the methods partitionBy <https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD.partitionBy> and mapPartitions <https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD.mapPartitions> are specific to RDDs while you're talking about DataFrames <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame>. I guess you have few options including: 1. use the Dataframe.rdd <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.rdd> call and process the returned RDD. Please note the return type for this call is and RDD of Row 2. User the groupBy <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy> from Dataframes and start from there, this may involved defining an udf or leverage on the existing GroupedData <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData> functions.
It really depends on your use-case and your performance requirements. HTH On Sun, Sep 30, 2018 at 8:31 PM dimitris plakas <dimitrisp...@gmail.com> wrote: > Hello everyone, > > I am trying to split a dataframe on partitions and i want to apply a > custom function on every partition. More precisely i have a dataframe like > the one below > > Group_Id | Id | Points > 1 | id1| Point1 > 2 | id2| Point2 > > I want to have a partition for every Group_Id and apply on every partition > a function defined by me. > I have tried with partitionBy('Group_Id').mapPartitions() but i receive > error. > Could you please advice me how to do it? >