Hi,

Thanks all,

I checked with both the approaches, grouping sets worked better for me,
because i didn't want to cache it as i am specifying large fraction of
memory to Shuffle operation.
However, i could only do grouping sets using HiveContext. I am using Spark
1.5 and I think SQLContext doesnt have this functionality, so incase any
one want to use SQLContext, they need to stick to cache option.


Thanks

On Sun, Feb 19, 2017 at 3:02 AM, Yong Zhang <java8...@hotmail.com> wrote:

> If you only need the group by in the same hierarchy logic, then you can
> group by at the lowest level, and cache it, then use the cached DF to
> derive to the higher level, so Spark will only scan the originally table
> once, and reuse the cache in the following.
>
>
> val df_base =  sqlContext.sql("select col1,col2,col3,col4,col5, count(*)
> from table groupby col1,col2,col3,col4,col5").cache
>
> df_base.registerTempTable("df_base")
>
> val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group
> by col1, col2")
>
> val df2 = // similar logic
>
> Yong
> ------------------------------
> *From:* Patrick <titlibat...@gmail.com>
> *Sent:* Saturday, February 18, 2017 4:23 PM
> *To:* user
> *Subject:* Efficient Spark-Sql queries when only nth Column changes
>
> Hi,
>
> I have read 5 columns from parquet into data frame. My queries on the
> parquet table is of below type:
>
> val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby
> col1,col2)
> val df2 = sqlContext.sql(select col1,col3,count(*) from table  groupby
> col1,col3)
> val df3 = sqlContext.sql(select col1,col4,count(*) from table  groupby
> col1,col4)
> val df4 = sqlContext.sql(select col1,col5,count(*) from table  groupby
> col1,col5)
>
> And then i require to union the results from df1 to df4 into a single df.
>
>
> So basically, only the second column is changing, Is there any efficient
> way to write the above queries  in Spark-Sql instead of writing 4 different
> queries(OR in loop) and doing union to get the result.
>
>
> Thanks
>
>
>
>
>
>

Reply via email to