Re: How to optimize multiple count( distinct col) in Hive SQL

panfei Wed, 23 Aug 2017 02:19:55 -0700

The full error stack is (which described here :
https://issues.apache.org/jira/browse/MAPREDUCE-6108) :


this error can not reproduce every time, after retry several times, the job
successfully finished.

2017-08-23 17:16:03,574 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
shuffle in fetcher#2
        at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
        at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
        at 
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
        at 
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305)
        at 
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)

2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task:
Runnning cleanup for the task


2017-08-23 13:10 GMT+08:00 panfei <cnwe...@gmail.com>:

> Hi Gopal, Thanks for all the information and suggestion.
>
> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.
>
> I think I should create a intermediate table which includes all the
> dimensions (including the serval kinds of ids), and then use spark-sql to
> calculate the distinct values separately (spark sql is really fast so ~~).
>
> thanks again.
>
> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gop...@apache.org>:
>
>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
>> …
>> > GROUPING_ID() AS gid,
>> > COUNT(1) AS dummy
>>
>> There are two things which prevent Hive from optimize multiple count
>> distincts.
>>
>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>>
>> The multiple count distincts are rewritten into a ROLLUP internally by
>> the CBO.
>>
>> https://issues.apache.org/jira/browse/HIVE-10901
>>
>> A single count distinct + other aggregates (like
>> min,max,count,count_distinct in 1 pass) is fixed via
>>
>> https://issues.apache.org/jira/browse/HIVE-16654
>>
>> There's no optimizer rule to combine both those scenarios.
>>
>> https://issues.apache.org/jira/browse/HIVE-15045
>>
>> There's a possibility that you're using Hive-1.x release branch the CBO
>> doesn't kick in unless column stats are present, but in the Hive-2.x series
>> you'll notice that some of these optimizations are not driven by a cost
>> function and are always applied if CBO is enabled.
>>
>> > is there any way to rewrite it to optimize the memory usage.
>>
>> If you want it to run through very slowly without errors, you can try
>> disabling all in-memory aggregations.
>>
>> set hive.map.aggr=false;
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
>
> --
> 不学习，不知道
>



-- 
不学习，不知道

Re: How to optimize multiple count( distinct col) in Hive SQL

Reply via email to