Hi Gopal, Thanks for all the information and suggestion. The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.
I think I should create a intermediate table which includes all the dimensions (including the serval kinds of ids), and then use spark-sql to calculate the distinct values separately (spark sql is really fast so ~~). thanks again. 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gop...@apache.org>: > > COUNT(DISTINCT monthly_user_id) AS monthly_active_users, > > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, > … > > GROUPING_ID() AS gid, > > COUNT(1) AS dummy > > There are two things which prevent Hive from optimize multiple count > distincts. > > Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE. > > The multiple count distincts are rewritten into a ROLLUP internally by the > CBO. > > https://issues.apache.org/jira/browse/HIVE-10901 > > A single count distinct + other aggregates (like > min,max,count,count_distinct in 1 pass) is fixed via > > https://issues.apache.org/jira/browse/HIVE-16654 > > There's no optimizer rule to combine both those scenarios. > > https://issues.apache.org/jira/browse/HIVE-15045 > > There's a possibility that you're using Hive-1.x release branch the CBO > doesn't kick in unless column stats are present, but in the Hive-2.x series > you'll notice that some of these optimizations are not driven by a cost > function and are always applied if CBO is enabled. > > > is there any way to rewrite it to optimize the memory usage. > > If you want it to run through very slowly without errors, you can try > disabling all in-memory aggregations. > > set hive.map.aggr=false; > > Cheers, > Gopal > > > -- 不学习,不知道