If you encode the data in something like parquet we usually have more information and will try to broadcast.
On Thu, Mar 17, 2016 at 7:27 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Anyways to cache the subquery or force a broadcast join without persisting > it? > > > > y > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* March-17-16 8:59 PM > *To:* Younes Naguib > *Cc:* user@spark.apache.org > *Subject:* Re: Subquery performance > > > > Try running EXPLAIN on both version of the query. > > > > Likely when you cache the subquery we know that its going to be small so > use a broadcast join instead of a shuffling the data. > > > > On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib < > younes.nag...@tritondigital.com> wrote: > > Hi all, > > > > I’m running a query that looks like the following: > > Select col1, count(1) > > From (Select col2, count(1) from tab2 group by col2) > > Inner join tab1 on (col1=col2) > > Group by col1 > > > > This creates a very large shuffle, 10 times the data size, as if the > subquery was executed for each row. > > Anything can be done to tune to help tune this? > > When the subquery in persisted, it runs much faster, and the shuffle is 50 > times smaller! > > > > *Thanks,* > > *Younes* > > >