---------- Forwarded message ----------
From: panfei <cnwe...@gmail.com>
Date: 2017-08-23 12:26 GMT+08:00
Subject: Fwd: How to optimize multiple count( distinct col) in Hive SQL
To: hive-...@hadoop.apache.org



---------- Forwarded message ----------
From: panfei <cnwe...@gmail.com>
Date: 2017-08-23 12:26 GMT+08:00
Subject: How to optimize multiple count( distinct col) in Hive SQL
To: CDH Users <cdh-u...@cloudera.org>


SELECT product_id AS product_id,
channel_id AS channel_id,
server_id AS server_id,
platform AS platform,
COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
COUNT(DISTINCT daily_user_id) AS daily_active_users,
COUNT(DISTINCT monthly_device_id) AS monthly_active_devices,
COUNT(DISTINCT weekly_device_id) AS weekly_active_devices,
COUNT(DISTINCT daily_device_id) AS daily_active_devices,
COUNT(DISTINCT monthly_ip) AS monthly_active_ips,
COUNT(DISTINCT weekly_ip) AS weekly_active_ips,
COUNT(DISTINCT daily_ip) AS daily_active_ips,
COUNT(DISTINCT monthly_role_id) AS monthly_active_roles,
COUNT(DISTINCT weekly_role_id) AS weekly_active_roles,
COUNT(DISTINCT daily_role_id) AS daily_active_roles,
GROUPING_ID() AS gid,
COUNT(1) AS dummy
FROM (.........)
GROUP BY .....



There are so many COUNT(DISTINCT xxx) in the select clause. when execute
this sql on very large dataset, it reports OOM Java Heap Space Error.

I think it's the count(distinct xxx)s which occupy lots of memory in the
heap for deduplication . am I rigt ? or is there any way to rewrite it to
optimize the memory usage.

every work node has 32GB physical memory in total, some related settings:

hive.optimize.distinct.rewrite=true;
mapreduce.map.memory.mb=1638;
mapreduce.reduce.memory.mb=3276;
yarn.nodemanager.resource.memory-mb=26208;
mapreduce.reduce.shuffle.memory.limit.percent=0.25;


thanks



-- 
不学习,不知道



-- 
不学习,不知道

Reply via email to