Thanks for your help

I tried to cache the lookup tables and left out join with the big table (DF). 
Join does not seem to be using broadcast join-still it goes with hash partition 
join and shuffling big table. Here is the scenario


…
table1 as big_df
left outer join
table2 as lkup
on big_df.lkupid = lkup.lkupid

table1 above is well distributed across all 40 partitions because 
sqlContext.sql("SET spark.sql.shuffle.partitions=40"). table2 is small, using 
just 2 partition.  s. After the join stage, sparkUI showed me that all 
activities ended up in  just 2 executors. When I tried to dump the data in hdfs 
after join stage, all data ended up in 2 partition files and rest 38 files are 
0 sized files.

Since above one did not work, I tried to broadcast DF and registered as table 
before join. 

val table2_df = sqlContext.sql("select * from table2")
val broadcast_table2 =sc.broadcast(table2_df)
broadcast_table2.value.registerTempTable(“table2”)

Broadcast is also having same issue as explained above. All data processed by 
just executors due to lookup skew.

Any more idea to tackle this issue in Spark Dataframe?

Thanks
Vijay


> On Aug 14, 2015, at 10:27 AM, Silvio Fiorito <silvio.fior...@granturing.com> 
> wrote:
> 
> You could cache the lookup DataFrames, it’ll then do a broadcast join.
> 
> 
> 
> 
> On 8/14/15, 9:39 AM, "VIJAYAKUMAR JAWAHARLAL" <sparkh...@data2o.io> wrote:
> 
>> Hi
>> 
>> I am facing huge performance problem when I am trying to left outer join 
>> very big data set (~140GB) with bunch of small lookups [Start schema type]. 
>> I am using data frame  in spark sql. It looks like data is shuffled and 
>> skewed when that join happens. Is there any way to improve performance of 
>> such type of join in spark? 
>> 
>> How can I hint optimizer to go with replicated join etc., to avoid shuffle? 
>> Would it help to create broadcast variables on small lookups?  If I create 
>> broadcast variables, how can I convert them into data frame and use them in 
>> sparksql type of join?
>> 
>> Thanks
>> Vijay
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to