Hi, Thanks all,
I checked with both the approaches, grouping sets worked better for me, because i didn't want to cache it as i am specifying large fraction of memory to Shuffle operation. However, i could only do grouping sets using HiveContext. I am using Spark 1.5 and I think SQLContext doesnt have this functionality, so incase any one want to use SQLContext, they need to stick to cache option. Thanks On Sun, Feb 19, 2017 at 3:02 AM, Yong Zhang <java8...@hotmail.com> wrote: > If you only need the group by in the same hierarchy logic, then you can > group by at the lowest level, and cache it, then use the cached DF to > derive to the higher level, so Spark will only scan the originally table > once, and reuse the cache in the following. > > > val df_base = sqlContext.sql("select col1,col2,col3,col4,col5, count(*) > from table groupby col1,col2,col3,col4,col5").cache > > df_base.registerTempTable("df_base") > > val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group > by col1, col2") > > val df2 = // similar logic > > Yong > ------------------------------ > *From:* Patrick <titlibat...@gmail.com> > *Sent:* Saturday, February 18, 2017 4:23 PM > *To:* user > *Subject:* Efficient Spark-Sql queries when only nth Column changes > > Hi, > > I have read 5 columns from parquet into data frame. My queries on the > parquet table is of below type: > > val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby > col1,col2) > val df2 = sqlContext.sql(select col1,col3,count(*) from table groupby > col1,col3) > val df3 = sqlContext.sql(select col1,col4,count(*) from table groupby > col1,col4) > val df4 = sqlContext.sql(select col1,col5,count(*) from table groupby > col1,col5) > > And then i require to union the results from df1 to df4 into a single df. > > > So basically, only the second column is changing, Is there any efficient > way to write the above queries in Spark-Sql instead of writing 4 different > queries(OR in loop) and doing union to get the result. > > > Thanks > > > > > >