The full error stack is (which described here : https://issues.apache.org/jira/browse/MAPREDUCE-6108) :
this error can not reproduce every time, after retry several times, the job successfully finished. 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56) at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46) at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) 2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task 2017-08-23 13:10 GMT+08:00 panfei <cnwe...@gmail.com>: > Hi Gopal, Thanks for all the information and suggestion. > > The Hive version is 2.0.1 and use Hive-on-MR as the execution engine. > > I think I should create a intermediate table which includes all the > dimensions (including the serval kinds of ids), and then use spark-sql to > calculate the distinct values separately (spark sql is really fast so ~~). > > thanks again. > > 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gop...@apache.org>: > >> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users, >> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, >> … >> > GROUPING_ID() AS gid, >> > COUNT(1) AS dummy >> >> There are two things which prevent Hive from optimize multiple count >> distincts. >> >> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE. >> >> The multiple count distincts are rewritten into a ROLLUP internally by >> the CBO. >> >> https://issues.apache.org/jira/browse/HIVE-10901 >> >> A single count distinct + other aggregates (like >> min,max,count,count_distinct in 1 pass) is fixed via >> >> https://issues.apache.org/jira/browse/HIVE-16654 >> >> There's no optimizer rule to combine both those scenarios. >> >> https://issues.apache.org/jira/browse/HIVE-15045 >> >> There's a possibility that you're using Hive-1.x release branch the CBO >> doesn't kick in unless column stats are present, but in the Hive-2.x series >> you'll notice that some of these optimizations are not driven by a cost >> function and are always applied if CBO is enabled. >> >> > is there any way to rewrite it to optimize the memory usage. >> >> If you want it to run through very slowly without errors, you can try >> disabling all in-memory aggregations. >> >> set hive.map.aggr=false; >> >> Cheers, >> Gopal >> >> >> > > > -- > 不学习,不知道 > -- 不学习,不知道