Re: How to optimize multiple count( distinct col) in Hive SQL

panfei Tue, 22 Aug 2017 22:11:03 -0700

Hi Gopal, Thanks for all the information and suggestion.

The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.


I think I should create a intermediate table which includes all the
dimensions (including the serval kinds of ids), and then use spark-sql to
calculate the distinct values separately (spark sql is really fast so ~~).

thanks again.

2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gop...@apache.org>:

> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
> …
> > GROUPING_ID() AS gid,
> > COUNT(1) AS dummy
>
> There are two things which prevent Hive from optimize multiple count
> distincts.
>
> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>
> The multiple count distincts are rewritten into a ROLLUP internally by the
> CBO.
>
> https://issues.apache.org/jira/browse/HIVE-10901
>
> A single count distinct + other aggregates (like
> min,max,count,count_distinct in 1 pass) is fixed via
>
> https://issues.apache.org/jira/browse/HIVE-16654
>
> There's no optimizer rule to combine both those scenarios.
>
> https://issues.apache.org/jira/browse/HIVE-15045
>
> There's a possibility that you're using Hive-1.x release branch the CBO
> doesn't kick in unless column stats are present, but in the Hive-2.x series
> you'll notice that some of these optimizations are not driven by a cost
> function and are always applied if CBO is enabled.
>
> > is there any way to rewrite it to optimize the memory usage.
>
> If you want it to run through very slowly without errors, you can try
> disabling all in-memory aggregations.
>
> set hive.map.aggr=false;
>
> Cheers,
> Gopal
>
>
>


-- 
不学习，不知道

Re: How to optimize multiple count( distinct col) in Hive SQL

Reply via email to