There’s a distributed algorithm for windows function that is based on the ORDER BY clause rather than the PARTITION BY clause. I doubt if is implemented in Hive, but it’s worth a shot.
select * ,row_number () over (order by rand()) as ETL_ROW_ID from INTER_ETL ; For unique, not consecutive values you can try this: select * ,cast (floor(r*1000000) + 1 as bigint) + (1000000 * (row_number () over (partition by cast (floor(r*1000000) + 1 as bigint) order by null) - 1)) as ETL_ROW_ID from (select *,rand() as r from INTER_ETL) as t ; If you have in your table a column/combination of columns with unified distribution you can also do something like this: select * , (abs(hash(MyCol1,MyCol2))%1000000 + 1) + (row_number () over (partition by (abs(hash(MyCol1,MyCol2))%1000000 + 1) order by null) - 1) * 1000000L as ETL_ROW_ID from INTER_ETL ; For consecutive values you can do something (ugly…) like this: select * ,a.accum_rows + row_number () over (partition by abs(hash(t.MyCol1,t.MyCol2))%10000 order by null) as ETL_ROW_ID from INTER_ETL as t join (select abs(hash(MyCol1,MyCol2))%10000 as group_id ,sum (count (*)) over (order by MyCol1,MyCol2 rows between unbounded preceding and 1 preceding) - count(*) as accum_rows from INTER_ETL group by abs(hash(MyCol1,MyCol2))%10000 ) as a on a.group_id = abs(hash(t.MyCol1,t.MyCol2))%10000 ; From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com] Sent: Tuesday, June 28, 2016 11:52 PM To: Markovitz, Dudu <dmarkov...@paypal.com> Cc: user@hive.apache.org Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer ETL_ROW_ID is to be consecutive number. I need to check if having unique number would not break any logic. Considering unique number for ETL_ROW_ID column, what are optimum options available? What id it has to be consecutive number only? Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Jun 28, 2016 at 4:17 PM, Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote: I’m guessing ETL_ROW_ID should be unique but not necessarily contain only consecutive numbers? From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>] Sent: Tuesday, June 28, 2016 10:57 PM To: Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> Cc: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer Hi Dudu, You are correct ...ROW_NUMBER() is main culprit. ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good solution? Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote: The row_number operation seems to be skewed. Dudu From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>] Sent: Tuesday, June 28, 2016 8:54 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Query Performance Issue : Group By and Distinct and load on reducer Hi All, I am having performance issue with data skew of the distinct statement in Hive<http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>. See below query with DISTINCT operator. Original Query : SELECT DISTINCT SD.REGION ,SD.HEADEND ,SD.NETWORK ,SD.RETAILUNITCODE ,SD.LOGTIMEDATE ,SD.SPOTKEY ,SD.CRE_DT ,CASE WHEN SD.LOGTIMEDATE IS NULL THEN 'Y' ELSE 'N' END AS DROP_REASON ,ROW_NUMBER() OVER ( ORDER BY NULL ) AS ETL_ROW_ID FROM INTER_ETL AS SD; Table INTER_ETL used for query is big enough. From the logs , it seems that data skew for specific set of values , causing one of reducer have to do all the job. I tried to achieve the same through GROUP BY still having the same issue. Help me to understand the issue and resolution. Query with Distinct V2 : CREATE TEMPORARY TABLE ETL_TMP AS SELECT DISTINCT dt.* FROM ( SELECT SD.REGION ,SD.HEADEND ,SD.NETWORK ,SD.RETAILUNITCODE ,SD.LOGTIMEDATE ,SD.SPOTKEY ,SD.CRE_DT ,CASE WHEN SD.LOGTIMEDATE IS NULL THEN 'Y' ELSE 'N' END AS DROP_REASON ,ROW_NUMBER() OVER ( ORDER BY NULL ) AS ETL_ROW_ID FROM INTER_ETL AS SD ) AS dt; Logs: INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 Query With Group By: CREATE TEMPORARY TABLE ETL_TMP AS SELECT REGION ,HEADEND ,NETWORK ,RETAILUNITCODE ,LOGTIMEDATE ,SPOTKEY ,CRE_DT ,DROP_REASON ,ETL_ROW_ID FROM ( SELECT SD.REGION ,SD.HEADEND ,SD.NETWORK ,SD.RETAILUNITCODE ,SD.LOGTIMEDATE ,SD.SPOTKEY ,SD.CRE_DT ,CASE WHEN SD.LOGTIMEDATE IS NULL THEN 'Y' ELSE 'N' END AS DROP_REASON ,ROW_NUMBER() OVER ( ORDER BY NULL ) AS ETL_ROW_ID FROM INTER_ETL AS SD ) AS dt GROUP BY REGION ,HEADEND ,NETWORK ,RETAILUNITCODE ,LOGTIMEDATE ,SPOTKEY ,CRE_DT ,DROP_REASON ,ETL_ROW_ID; Logs: INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 Table details : Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ; +--------------------------------------------------------------------------------------------------------------------------------------------+--+ | DFS Output | +--------------------------------------------------------------------------------------------------------------------------------------------+--+ | Found 15 items | | -rwxrwxrwx 3 Z56034 hdfs 2075244899 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0 | | -rwxrwxrwx 3 Z56034 hdfs 2090030620 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0 | | -rwxrwxrwx 3 Z56034 hdfs 2025516774 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0 | | -rwxrwxrwx 3 Z56034 hdfs 1986848213 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0 | | -rwxrwxrwx 3 Z56034 hdfs 2018883723 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0 | | -rwxrwxrwx 3 Z56034 hdfs 1984690335 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0 | | -rwxrwxrwx 3 Z56034 hdfs 1987494444 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0 | | -rwxrwxrwx 3 Z56034 hdfs 1974674515 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0 | | -rwxrwxrwx 3 Z56034 hdfs 1963720218 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0 | | -rwxrwxrwx 3 Z56034 hdfs 1965892384 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0 | | -rwxrwxrwx 3 Z56034 hdfs 1974272622 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0 | | -rwxrwxrwx 3 Z56034 hdfs 1971948208 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0 | | -rwxrwxrwx 3 Z56034 hdfs 1968141886 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0 | | -rwxrwxrwx 3 Z56034 hdfs 1970930771 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0 | | -rwxrwxrwx 3 Z56034 hdfs 192820628 2016-06-28 10:23 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0 | +--------------------------------------------------------------------------------------------------------------------------------------------+--+ 'numFiles'='15', 'numRows'='108363614', Regards Sanjiv Singh Mob : +091 9990-447-339