Its helpful to always include the output of df.explain(true) when you are asking about performance.
On Mon, Feb 29, 2016 at 6:14 PM, Deepak Gopalakrishnan <dgk...@gmail.com> wrote: > Hello All, > > I'm trying to join 2 dataframes A and B with a > > sqlContext.sql("SELECT * FROM A INNER JOIN B ON A.a=B.a"); > > Now what I have done is that I have registeredTempTables for A and B after > loading these DataFrames from different sources. I need the join to be > really fast and I was wondering if there is a way to use the SQL statement > and then being able to do a mapper side join ( say my table B is small) ? > > I read some articles on using broadcast to do mapper side joins. Could I > do something like this and then execute my sql statement to achieve mapper > side join ? > > DataFrame B = sparkContext.broadcast(B); > B.registerTempTable("B"); > > > I have a join as stated above and I see in my executor logs the below : > > 16/02/29 17:02:35 INFO TaskSetManager: Finished task 198.0 in stage 7.0 > (TID 1114) in 20354 ms on localhost (196/200) > > 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty > blocks out of 200 blocks > > 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote > fetches in 0 ms > > 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty > blocks out of 128 blocks > > 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote > fetches in 0 ms > > 16/02/29 17:03:03 INFO Executor: Finished task 199.0 in stage 7.0 (TID > 1115). 2511 bytes result sent to driver > > 16/02/29 17:03:03 INFO TaskSetManager: Finished task 199.0 in stage 7.0 > (TID 1115) in 27621 ms on localhost (197/200) > > *16/02/29 17:07:06 INFO UnsafeExternalSorter: Thread 124 spilling sort > data of 256.0 KB to disk (0 time so far)* > > > Now, I have around 10G of executor memory and my memory faction should be > the default ( 0.75 as per the documentation) and my memory usage is < 1.5G( > obtained from the Storage tab on Spark dashboard), but still it says > spilling sort data. I'm a little surprised why this happens even when I > have enough memory free. > > Any inputs will be greatly appreciated! > > Thanks > -- > Regards, > *Deepak Gopalakrishnan* > *Mobile*:+918891509774 > *Skype* : deepakgk87 > http://myexps.blogspot.com > >