You could try caching the table. This would avoid the double read, but not the shuffle (at least today with the current optimizer).
On Tue, Sep 29, 2015 at 5:21 PM, Data Science Education < datasci...@gmail.com> wrote: > As part of fairly complex processing, I am executing a self join query > using HiveContext against a Hive table to find the latest Transaction, > oldest Transaction etc: for a given set of Attributes. I am still using > v1.3.1 and so Window functions are not an option. The simplified query > looks like below. > > val df = hiveContext.sql("""SELECT TAB1.KEY1 ,TAB1.KEY2 > > ,MAX(CASE WHEN (TAB1.FLD10 = TAB2.min_FLD10) THEN TAB1.FLD11 > > ELSE -9999999 END) AS NEW_FLD > > FROM TAB1 > > INNER JOIN > > ( SELECT KEY1 ,KEY2 , > > MIN(FLD10) AS min_FLD10 > > FROM TAB1 > > WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01' > > GROUP BY KEY1 ,KEY2 ) TAB2 > > ON TAB1.KEY1= TAB2.KEY1AND TAB1.KEY2= TAB2.KEY1 > > WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01' > > GROUP BY TAB1.KEY1, TAB1.KEY2""") > > I see that ~18,000 HDFS blocks are read TWICE and then the Shuffle happens > . Is there a way to avoid reading the same blocks TWICE during the Map > Stage? Is there a way to try to avoid Shuffle? Thank You. >