Re: How to optimize multiple count( distinct col) in Hive SQL

Gopal Vijayaraghavan Tue, 22 Aug 2017 21:57:24 -0700

> COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
> COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
…
> GROUPING_ID() AS gid,
> COUNT(1) AS dummy


There are two things which prevent Hive from optimize multiple count distincts.

Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.

The multiple count distincts are rewritten into a ROLLUP internally by the CBO.

https://issues.apache.org/jira/browse/HIVE-10901

A single count distinct + other aggregates (like min,max,count,count_distinct 
in 1 pass) is fixed via 

https://issues.apache.org/jira/browse/HIVE-16654

There's no optimizer rule to combine both those scenarios.

https://issues.apache.org/jira/browse/HIVE-15045

There's a possibility that you're using Hive-1.x release branch the CBO doesn't 
kick in unless column stats are present, but in the Hive-2.x series you'll 
notice that some of these optimizations are not driven by a cost function and 
are always applied if CBO is enabled.

> is there any way to rewrite it to optimize the memory usage.

If you want it to run through very slowly without errors, you can try disabling 
all in-memory aggregations.

set hive.map.aggr=false;

Cheers,
Gopal

Re: How to optimize multiple count( distinct col) in Hive SQL

Reply via email to