Re: Pyspark Partitioning

2018-10-04 Thread Vitaliy Pisarev
Groupby is an operator you would use if you wanted to *aggregate* the
values that are grouped by rhe specify key.

In your case you want to retain access to the values.

You need to do df.partitionBy and then you can map the partirions. Of
course you need to be carefull of potential skews in the resulting
partitions.

On Thu, Oct 4, 2018, 23:27 dimitris plakas  wrote:

> Hello everyone,
>
> Here is an issue that i am facing in partitioning dtafarame.
>
> I have a dataframe which called data_df. It is look like:
>
> Group_Id | Object_Id | Trajectory
>1 |  obj1| Traj1
>2 |  obj2| Traj2
>1 |  obj3| Traj3
>3 |  obj4| Traj4
>2 |  obj5| Traj5
>
> This dataframe has 5045 rows where each row has value in Group_Id from 1
> to 7, and the number of rows per group_id is arbitrary.
> I want to split the rdd which produced by from this dataframe in 7
> partitions one for each group_id and then apply mapPartitions() where i
> call function custom_func(). How can i create these partitions from this
> dataframe? Should i first apply group by (create the grouped_df) in order
> to create a dataframe with 7 rows and then call
> partitioned_rdd=grouped_df.rdd.mapPartitions()?
> Which is the optimal way to do it?
>
> Thank you in advance
>


Pyspark Partitioning

2018-10-04 Thread dimitris plakas
Hello everyone,

Here is an issue that i am facing in partitioning dtafarame.

I have a dataframe which called data_df. It is look like:

Group_Id | Object_Id | Trajectory
   1 |  obj1| Traj1
   2 |  obj2| Traj2
   1 |  obj3| Traj3
   3 |  obj4| Traj4
   2 |  obj5| Traj5

This dataframe has 5045 rows where each row has value in Group_Id from 1 to
7, and the number of rows per group_id is arbitrary.
I want to split the rdd which produced by from this dataframe in 7
partitions one for each group_id and then apply mapPartitions() where i
call function custom_func(). How can i create these partitions from this
dataframe? Should i first apply group by (create the grouped_df) in order
to create a dataframe with 7 rows and then call
partitioned_rdd=grouped_df.rdd.mapPartitions()?
Which is the optimal way to do it?

Thank you in advance


Re: Pyspark Partitioning

2018-10-01 Thread Gourav Sengupta
Hi,

the most simple option is create UDF's of these different functions and
then use case statement (or similar) in SQL and pass it on. But this is low
tech, in case you have conditions based on record values which are even
more granular, why not use a single UDF, and then let conditions handle it.

But I think that UDF is not that super unless you use Scala.

It will be interesting to see if there are other scalable options (which
are not RDD based) from the group.

Regards,
Gourav Sengupta

On Sun, Sep 30, 2018 at 7:31 PM dimitris plakas 
wrote:

> Hello everyone,
>
> I am trying to split a dataframe on partitions and i want to apply a
> custom function on every partition. More precisely i have a dataframe like
> the one below
>
> Group_Id | Id | Points
> 1| id1| Point1
> 2| id2| Point2
>
> I want to have a partition for every Group_Id and apply on every partition
> a function defined by me.
> I have tried with partitionBy('Group_Id').mapPartitions() but i receive
> error.
> Could you please advice me how to do it?
>


Re: Pyspark Partitioning

2018-09-30 Thread ayan guha
Hi

There are a set pf finction which can be used with the construct
Over (partition by col order by col).

You search for rank and window functions in spark documentation.

On Mon, 1 Oct 2018 at 5:29 am, Riccardo Ferrari  wrote:

> Hi Dimitris,
>
> I believe the methods partitionBy
> 
> and mapPartitions
> 
> are specific to RDDs while you're talking about DataFrames
> .
> I guess you have few options including:
> 1. use the Dataframe.rdd
> 
> call and process the returned RDD. Please note the return type for this
> call is and RDD of Row
> 2. User the groupBy
> 
> from Dataframes and start from there, this may involved defining an udf or
> leverage on the existing GroupedData
> 
> functions.
>
> It really depends on your use-case and your performance requirements.
> HTH
>
> On Sun, Sep 30, 2018 at 8:31 PM dimitris plakas 
> wrote:
>
>> Hello everyone,
>>
>> I am trying to split a dataframe on partitions and i want to apply a
>> custom function on every partition. More precisely i have a dataframe like
>> the one below
>>
>> Group_Id | Id | Points
>> 1| id1| Point1
>> 2| id2| Point2
>>
>> I want to have a partition for every Group_Id and apply on every
>> partition a function defined by me.
>> I have tried with partitionBy('Group_Id').mapPartitions() but i receive
>> error.
>> Could you please advice me how to do it?
>>
> --
Best Regards,
Ayan Guha


Re: Pyspark Partitioning

2018-09-30 Thread Riccardo Ferrari
Hi Dimitris,

I believe the methods partitionBy

and mapPartitions

are specific to RDDs while you're talking about DataFrames
.
I guess you have few options including:
1. use the Dataframe.rdd

call and process the returned RDD. Please note the return type for this
call is and RDD of Row
2. User the groupBy

from Dataframes and start from there, this may involved defining an udf or
leverage on the existing GroupedData

functions.

It really depends on your use-case and your performance requirements.
HTH

On Sun, Sep 30, 2018 at 8:31 PM dimitris plakas 
wrote:

> Hello everyone,
>
> I am trying to split a dataframe on partitions and i want to apply a
> custom function on every partition. More precisely i have a dataframe like
> the one below
>
> Group_Id | Id | Points
> 1| id1| Point1
> 2| id2| Point2
>
> I want to have a partition for every Group_Id and apply on every partition
> a function defined by me.
> I have tried with partitionBy('Group_Id').mapPartitions() but i receive
> error.
> Could you please advice me how to do it?
>


Pyspark Partitioning

2018-09-30 Thread dimitris plakas
Hello everyone,

I am trying to split a dataframe on partitions and i want to apply a custom
function on every partition. More precisely i have a dataframe like the one
below

Group_Id | Id | Points
1| id1| Point1
2| id2| Point2

I want to have a partition for every Group_Id and apply on every partition
a function defined by me.
I have tried with partitionBy('Group_Id').mapPartitions() but i receive
error.
Could you please advice me how to do it?