Re: Spark SQL, dataframe join questions.

2017-03-29 Thread vaquar khan
HI ,

I found following two links are helpful sharing with you .

http://stackoverflow.com/questions/38353524/how-to-ensure-partitioning-induced-by-spark-dataframe-join

http://spark.apache.org/docs/latest/configuration.html


Regards,
Vaquar khan

On Wed, Mar 29, 2017 at 2:45 PM, Vidya Sujeet <sjayatheer...@gmail.com>
wrote:

> In repartition, every element in the partition is moved to a new
> partition..doing a full shuffle compared to shuffles done by reduceBy
> clauses. With this in mind, repartition would increase your query
> performance. ReduceBy key will also shuffle based on the aggregation.
>
> The best way to design is to check the query plan of your data frame join
> query and do RDD joins accordingly, if needed.
>
>
> On Wed, Mar 29, 2017 at 10:55 AM, Yong Zhang <java8...@hotmail.com> wrote:
>
>> You don't need to repartition your data just for join purpose. But if the
>> either parties of join is already partitioned, Spark will use this
>> advantage as part of join optimization.
>>
>> Should you reduceByKey before the join really depend on your join logic.
>> ReduceByKey will shuffle, and following join COULD cause another shuffle.
>> So I am not sure if it is a smart way.
>>
>> Yong
>>
>> --
>> *From:* shyla deshpande <deshpandesh...@gmail.com>
>> *Sent:* Wednesday, March 29, 2017 12:33 PM
>> *To:* user
>> *Subject:* Re: Spark SQL, dataframe join questions.
>>
>>
>>
>> On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Following are my questions. Thank you.
>>>
>>> 1. When joining dataframes is it a good idea to repartition on the key 
>>> column that is used in the join or
>>> the optimizer is too smart so forget it.
>>>
>>> 2. In RDD join, wherever possible we do reduceByKey before the join to 
>>> avoid a big shuffle of data. Do we need
>>> to do anything similar with dataframe joins, or the optimizer is too smart 
>>> so forget it.
>>>
>>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Vidya Sujeet
In repartition, every element in the partition is moved to a new
partition..doing a full shuffle compared to shuffles done by reduceBy
clauses. With this in mind, repartition would increase your query
performance. ReduceBy key will also shuffle based on the aggregation.

The best way to design is to check the query plan of your data frame join
query and do RDD joins accordingly, if needed.


On Wed, Mar 29, 2017 at 10:55 AM, Yong Zhang <java8...@hotmail.com> wrote:

> You don't need to repartition your data just for join purpose. But if the
> either parties of join is already partitioned, Spark will use this
> advantage as part of join optimization.
>
> Should you reduceByKey before the join really depend on your join logic.
> ReduceByKey will shuffle, and following join COULD cause another shuffle.
> So I am not sure if it is a smart way.
>
> Yong
>
> --
> *From:* shyla deshpande <deshpandesh...@gmail.com>
> *Sent:* Wednesday, March 29, 2017 12:33 PM
> *To:* user
> *Subject:* Re: Spark SQL, dataframe join questions.
>
>
>
> On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande <deshpandesh...@gmail.com
> > wrote:
>
>> Following are my questions. Thank you.
>>
>> 1. When joining dataframes is it a good idea to repartition on the key 
>> column that is used in the join or
>> the optimizer is too smart so forget it.
>>
>> 2. In RDD join, wherever possible we do reduceByKey before the join to avoid 
>> a big shuffle of data. Do we need
>> to do anything similar with dataframe joins, or the optimizer is too smart 
>> so forget it.
>>
>>
>


Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Yong Zhang
You don't need to repartition your data just for join purpose. But if the 
either parties of join is already partitioned, Spark will use this advantage as 
part of join optimization.

Should you reduceByKey before the join really depend on your join logic. 
ReduceByKey will shuffle, and following join COULD cause another shuffle. So I 
am not sure if it is a smart way.

Yong


From: shyla deshpande <deshpandesh...@gmail.com>
Sent: Wednesday, March 29, 2017 12:33 PM
To: user
Subject: Re: Spark SQL, dataframe join questions.



On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande 
<deshpandesh...@gmail.com<mailto:deshpandesh...@gmail.com>> wrote:

Following are my questions. Thank you.

1. When joining dataframes is it a good idea to repartition on the key column 
that is used in the join or
the optimizer is too smart so forget it.

2. In RDD join, wherever possible we do reduceByKey before the join to avoid a 
big shuffle of data. Do we need
to do anything similar with dataframe joins, or the optimizer is too smart so 
forget it.



Re: Spark SQL, dataframe join questions.

2017-03-29 Thread shyla deshpande
On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande 
wrote:

> Following are my questions. Thank you.
>
> 1. When joining dataframes is it a good idea to repartition on the key column 
> that is used in the join or
> the optimizer is too smart so forget it.
>
> 2. In RDD join, wherever possible we do reduceByKey before the join to avoid 
> a big shuffle of data. Do we need
> to do anything similar with dataframe joins, or the optimizer is too smart so 
> forget it.
>
>