Thanks for your help I tried to cache the lookup tables and left out join with the big table (DF). Join does not seem to be using broadcast join-still it goes with hash partition join and shuffling big table. Here is the scenario
… table1 as big_df left outer join table2 as lkup on big_df.lkupid = lkup.lkupid table1 above is well distributed across all 40 partitions because sqlContext.sql("SET spark.sql.shuffle.partitions=40"). table2 is small, using just 2 partition. s. After the join stage, sparkUI showed me that all activities ended up in just 2 executors. When I tried to dump the data in hdfs after join stage, all data ended up in 2 partition files and rest 38 files are 0 sized files. Since above one did not work, I tried to broadcast DF and registered as table before join. val table2_df = sqlContext.sql("select * from table2") val broadcast_table2 =sc.broadcast(table2_df) broadcast_table2.value.registerTempTable(“table2”) Broadcast is also having same issue as explained above. All data processed by just executors due to lookup skew. Any more idea to tackle this issue in Spark Dataframe? Thanks Vijay > On Aug 14, 2015, at 10:27 AM, Silvio Fiorito <silvio.fior...@granturing.com> > wrote: > > You could cache the lookup DataFrames, it’ll then do a broadcast join. > > > > > On 8/14/15, 9:39 AM, "VIJAYAKUMAR JAWAHARLAL" <sparkh...@data2o.io> wrote: > >> Hi >> >> I am facing huge performance problem when I am trying to left outer join >> very big data set (~140GB) with bunch of small lookups [Start schema type]. >> I am using data frame in spark sql. It looks like data is shuffled and >> skewed when that join happens. Is there any way to improve performance of >> such type of join in spark? >> >> How can I hint optimizer to go with replicated join etc., to avoid shuffle? >> Would it help to create broadcast variables on small lookups? If I create >> broadcast variables, how can I convert them into data frame and use them in >> sparksql type of join? >> >> Thanks >> Vijay >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org