Re: How to optimize multiple count( distinct col) in Hive SQL

2017-09-03 Thread panfei
as for the amount of data getting larger and larger, the same OOM occurs again, and we set hive.exec.reducers.bytes.per.reducer from 256MB to 64MB, and everything goes well after that ~ os the root cause of the issue is one reduce cannot process so much data in a round. hope it helps. 2017-08-

Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-23 Thread panfei
by decreasing mapreduce.reduce.shuffle.parallelcopies from 20 to 5, it seems that everything goes well, no OOM ~~ 2017-08-23 17:19 GMT+08:00 panfei : > The full error stack is (which described here : https://issues.apache.org/ > jira/browse/MAPREDUCE-6108) : > > this error can not reproduce ever

Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-23 Thread panfei
The full error stack is (which described here : https://issues.apache.org/jira/browse/MAPREDUCE-6108) : this error can not reproduce every time, after retry several times, the job successfully finished. 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running chil

Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread panfei
Hi Gopal, Thanks for all the information and suggestion. The Hive version is 2.0.1 and use Hive-on-MR as the execution engine. I think I should create a intermediate table which includes all the dimensions (including the serval kinds of ids), and then use spark-sql to calculate the distinct value

Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread Gopal Vijayaraghavan
> COUNT(DISTINCT monthly_user_id) AS monthly_active_users, > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, … > GROUPING_ID() AS gid, > COUNT(1) AS dummy There are two things which prevent Hive from optimize multiple count distincts. Another aggregate like a count(1) or a Grouping sets li