> COUNT(DISTINCT monthly_user_id) AS monthly_active_users, > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, … > GROUPING_ID() AS gid, > COUNT(1) AS dummy
There are two things which prevent Hive from optimize multiple count distincts. Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE. The multiple count distincts are rewritten into a ROLLUP internally by the CBO. https://issues.apache.org/jira/browse/HIVE-10901 A single count distinct + other aggregates (like min,max,count,count_distinct in 1 pass) is fixed via https://issues.apache.org/jira/browse/HIVE-16654 There's no optimizer rule to combine both those scenarios. https://issues.apache.org/jira/browse/HIVE-15045 There's a possibility that you're using Hive-1.x release branch the CBO doesn't kick in unless column stats are present, but in the Hive-2.x series you'll notice that some of these optimizations are not driven by a cost function and are always applied if CBO is enabled. > is there any way to rewrite it to optimize the memory usage. If you want it to run through very slowly without errors, you can try disabling all in-memory aggregations. set hive.map.aggr=false; Cheers, Gopal