Re: How to optimize multiple count( distinct col) in Hive SQL

panfei Sun, 03 Sep 2017 22:09:03 -0700

as for the amount of data getting larger and larger, the same OOM occurs
again, and we set


hive.exec.reducers.bytes.per.reducer

from 256MB to 64MB, and everything goes well after that ~

os the root cause of the issue is one reduce cannot process so much data in
a round.  hope it helps.

2017-08-24 9:42 GMT+08:00 panfei <cnwe...@gmail.com>:

> by decreasing mapreduce.reduce.shuffle.parallelcopies from 20 to 5,  it
> seems that everything goes well, no OOM ~~
>
> 2017-08-23 17:19 GMT+08:00 panfei <cnwe...@gmail.com>:
>
>> The full error stack is (which described here :
>> https://issues.apache.org/jira/browse/MAPREDUCE-6108) :
>>
>> this error can not reproduce every time, after retry several times, the
>> job successfully finished.
>>
>> 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: 
>> Exception running child : 
>> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
>> shuffle in fetcher#2
>>      at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>      at java.security.AccessController.doPrivileged(Native Method)
>>      at javax.security.auth.Subject.doAs(Subject.java:422)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>      at 
>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
>>      at 
>> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
>>      at 
>> org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
>>      at 
>> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305)
>>      at 
>> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295)
>>      at 
>> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514)
>>      at 
>> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
>>      at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
>>
>> 2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnning 
>> cleanup for the task
>>
>>
>> 2017-08-23 13:10 GMT+08:00 panfei <cnwe...@gmail.com>:
>>
>>> Hi Gopal, Thanks for all the information and suggestion.
>>>
>>> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.
>>>
>>> I think I should create a intermediate table which includes all the
>>> dimensions (including the serval kinds of ids), and then use spark-sql to
>>> calculate the distinct values separately (spark sql is really fast so ~~).
>>>
>>> thanks again.
>>>
>>> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gop...@apache.org>:
>>>
>>>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
>>>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
>>>> …
>>>> > GROUPING_ID() AS gid,
>>>> > COUNT(1) AS dummy
>>>>
>>>> There are two things which prevent Hive from optimize multiple count
>>>> distincts.
>>>>
>>>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>>>>
>>>> The multiple count distincts are rewritten into a ROLLUP internally by
>>>> the CBO.
>>>>
>>>> https://issues.apache.org/jira/browse/HIVE-10901
>>>>
>>>> A single count distinct + other aggregates (like
>>>> min,max,count,count_distinct in 1 pass) is fixed via
>>>>
>>>> https://issues.apache.org/jira/browse/HIVE-16654
>>>>
>>>> There's no optimizer rule to combine both those scenarios.
>>>>
>>>> https://issues.apache.org/jira/browse/HIVE-15045
>>>>
>>>> There's a possibility that you're using Hive-1.x release branch the CBO
>>>> doesn't kick in unless column stats are present, but in the Hive-2.x series
>>>> you'll notice that some of these optimizations are not driven by a cost
>>>> function and are always applied if CBO is enabled.
>>>>
>>>> > is there any way to rewrite it to optimize the memory usage.
>>>>
>>>> If you want it to run through very slowly without errors, you can try
>>>> disabling all in-memory aggregations.
>>>>
>>>> set hive.map.aggr=false;
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> 不学习，不知道
>>>
>>
>>
>>
>> --
>> 不学习，不知道
>>
>
>
>
> --
> 不学习，不知道
>



-- 
不学习，不知道

Re: How to optimize multiple count( distinct col) in Hive SQL

Reply via email to