Hi Dudu, I tried the same on same table which has 6357592675 rows. See response of all three.
*I tried 1st one , its giving duplicates for rows. * > CREATE TEMPORARY TABLE INTER_ETL_T AS > select * > ,cast (floor(r*1000000) + 1 as bigint) + (1000000 * (row_number () over > (partition by cast (floor(r*1000000) + 1 as bigint) order by null) - 1)) > as ROW_NUM > from (select *,rand() as r from INTER_ETL) as t ; > select ROW_NUM, count(*) from INTER_ETL_T by ROW_NUM having count(*) > 1 limit 10; +--------------+------+--+ > | ROW_NUM| _c1 | > +--------------+------+--+ > | -2146932303 | 2 | > | -2146924922 | 2 | > | -2146922710 | 2 | > | -2146901450 | 2 | > | -2146897115 | 2 | > | -2146874805 | 2 | > | -2146869449 | 2 | > | -2146865918 | 2 | > | -2146864595 | 2 | > | -2146857688 | 2 | > +--------------+------+--+ On 2nd one, it is not giving any duplicate and was much faster than ROW_NUMBER() atleast. numRows=6357592675, totalSize=405516934422, rawDataSize=399159341747 *And on 3rd for consecutive number, query is not compatible to HIVE.* CREATE TEMPORARY TABLE INTER_ETL_T AS > select * > ,a.accum_rows + row_number () over (partition by > abs(hash(t.m_d_key,t.s_g_key))%10000 order by null) as ROW_NUM > from INTER_ETL as t > join (select abs(hash(m_d_key,s_g_key))%10000 as group_id > ,sum (count (*)) over (order by m_d_key,s_g_key rows between unbounded > preceding and 1 preceding) - count(*) as accum_rows > from INTER_ETL > group by abs(hash(m_d_key,s_g_key))%10000 > ) as a > on a.group_id = abs(hash(t.m_d_key,t.s_g_key))%10000 > ; Error : Error: Error while compiling statement: FAILED: SemanticException End of a > WindowFrame cannot be UNBOUNDED PRECEDING (state=42000,code=40000) Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Jun 28, 2016 at 6:16 PM, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote: > thanks a lot. > let me give it a try. > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Tue, Jun 28, 2016 at 5:32 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > >> There’s a distributed algorithm for windows function that is based on the >> ORDER BY clause rather than the PARTITION BY clause. >> >> I doubt if is implemented in Hive, but it’s worth a shot. >> >> >> >> select * >> >> ,row_number () over (order by rand()) as ETL_ROW_ID >> >> from INTER_ETL >> >> ; >> >> >> >> For unique, not consecutive values you can try this: >> >> >> >> select * >> >> ,cast (floor(r*1000000) + 1 as bigint) + (1000000 * >> (row_number () over (partition by cast (floor(r*1000000) + 1 as bigint) >> order by null) - 1)) as ETL_ROW_ID >> >> >> >> from (select *,rand() as r from INTER_ETL) as t >> >> ; >> >> >> >> If you have in your table a column/combination of columns with unified >> distribution you can also do something like this: >> >> >> >> select * >> >> , (abs(hash(MyCol1,MyCol2))%1000000 + 1) + (row_number () >> over (partition by (abs(hash(MyCol1,MyCol2))%1000000 + 1) order by null) >> - 1) * 1000000L as ETL_ROW_ID >> >> >> >> from INTER_ETL >> >> ; >> >> >> >> For consecutive values you can do something (ugly…) like this: >> >> >> >> select * >> >> ,a.accum_rows + row_number () over (partition by >> abs(hash(t.MyCol1,t.MyCol2))%10000 order by null) as ETL_ROW_ID >> >> >> >> from INTER_ETL as t >> >> >> >> join (select >> abs(hash(MyCol1,MyCol2))%10000 >> as group_id >> >> ,sum (count (*)) over (order by >> MyCol1,MyCol2 rows between unbounded preceding and 1 preceding) - >> count(*) as accum_rows >> >> >> >> from INTER_ETL >> >> >> >> group by abs(hash(MyCol1,MyCol2))%10000 >> >> ) >> >> as a >> >> >> >> on a.group_id = >> abs(hash(t.MyCol1,t.MyCol2))%10000 >> >> >> >> ; >> >> >> >> >> >> >> >> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com] >> *Sent:* Tuesday, June 28, 2016 11:52 PM >> >> *To:* Markovitz, Dudu <dmarkov...@paypal.com> >> *Cc:* user@hive.apache.org >> *Subject:* Re: Query Performance Issue : Group By and Distinct and load >> on reducer >> >> >> >> ETL_ROW_ID is to be consecutive number. I need to check if having unique >> number would not break any logic. >> >> >> >> Considering unique number for ETL_ROW_ID column, what are optimum >> options available? >> >> What id it has to be consecutive number only? >> >> >> >> >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> >> >> On Tue, Jun 28, 2016 at 4:17 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> I’m guessing ETL_ROW_ID should be unique but not necessarily contain >> only consecutive numbers? >> >> >> >> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com] >> *Sent:* Tuesday, June 28, 2016 10:57 PM >> *To:* Markovitz, Dudu <dmarkov...@paypal.com> >> *Cc:* user@hive.apache.org >> *Subject:* Re: Query Performance Issue : Group By and Distinct and load >> on reducer >> >> >> >> Hi Dudu, >> >> >> >> You are correct ...ROW_NUMBER() is main culprit. >> >> >> >> ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good >> solution? >> >> >> >> >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> >> >> On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> The row_number operation seems to be skewed. >> >> >> >> Dudu >> >> >> >> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com] >> *Sent:* Tuesday, June 28, 2016 8:54 PM >> *To:* user@hive.apache.org >> *Subject:* Query Performance Issue : Group By and Distinct and load on >> reducer >> >> >> >> Hi All, >> >> >> I am having performance issue with data skew of the distinct statement >> in Hive >> <http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>. >> See below query with DISTINCT operator. >> >> *Original Query : * >> >> >> >> SELECT DISTINCT >> >> SD.REGION >> >> ,SD.HEADEND >> ,SD.NETWORK >> ,SD.RETAILUNITCODE >> ,SD.LOGTIMEDATE >> ,SD.SPOTKEY >> ,SD.CRE_DT >> ,CASE >> WHEN SD.LOGTIMEDATE IS >> NULL >> THEN 'Y' >> ELSE 'N' >> END AS DROP_REASON >> ,ROW_NUMBER() OVER ( >> ORDER BY NULL >> ) AS ETL_ROW_ID >> FROM INTER_ETL AS SD; >> >> >> >> Table *INTER_ETL *used for query is big enough. >> From the logs , it seems that data skew for specific set of values , >> causing one of reducer have to do all the job. I tried to achieve the same >> through GROUP BY still having the same issue. Help me to understand the >> issue and resolution. >> >> *Query with Distinct V2 :* >> >> >> >> CREATE TEMPORARY TABLE ETL_TMP AS >> SELECT DISTINCT dt.* >> FROM ( >> SELECT SD.REGION >> ,SD.HEADEND >> ,SD.NETWORK >> ,SD.RETAILUNITCODE >> ,SD.LOGTIMEDATE >> ,SD.SPOTKEY >> ,SD.CRE_DT >> ,CASE >> WHEN SD.LOGTIMEDATE IS NULL >> THEN 'Y' >> ELSE 'N' >> END AS DROP_REASON >> ,ROW_NUMBER() OVER ( >> ORDER BY NULL >> ) AS ETL_ROW_ID >> FROM INTER_ETL AS SD >> ) AS dt; >> >> >> >> Logs: >> >> >> >> INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 >> INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 >> INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 >> INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 >> INFO : Map 1: 107/107 *Reducer 2: 417(+1)/418* Reducer 3: 0(+418)/418 >> >> >> >> >> >> *Query With Group By:* >> >> >> >> CREATE TEMPORARY TABLE ETL_TMP AS >> SELECT REGION >> ,HEADEND >> ,NETWORK >> ,RETAILUNITCODE >> ,LOGTIMEDATE >> ,SPOTKEY >> ,CRE_DT >> ,DROP_REASON >> ,ETL_ROW_ID >> FROM ( >> SELECT SD.REGION >> ,SD.HEADEND >> ,SD.NETWORK >> ,SD.RETAILUNITCODE >> ,SD.LOGTIMEDATE >> ,SD.SPOTKEY >> ,SD.CRE_DT >> ,CASE >> WHEN SD.LOGTIMEDATE IS NULL >> THEN 'Y' >> ELSE 'N' >> END AS DROP_REASON >> ,ROW_NUMBER() OVER ( >> ORDER BY NULL >> ) AS ETL_ROW_ID >> FROM INTER_ETL AS SD >> ) AS dt >> GROUP BY >> REGION >> ,HEADEND >> ,NETWORK >> ,RETAILUNITCODE >> ,LOGTIMEDATE >> ,SPOTKEY >> ,CRE_DT >> ,DROP_REASON >> ,ETL_ROW_ID; >> >> >> >> Logs: >> >> >> >> INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 >> INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 >> INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 >> INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 >> INFO : Map 1: 818/818 *Reducer 2: 417(+1)/418* Reducer 3: 0(+418)/418 >> >> >> >> *Table details :* >> >> >> >> Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ; >> >> >> +--------------------------------------------------------------------------------------------------------------------------------------------+--+ >> >> | DFS >> Output | >> >> >> +--------------------------------------------------------------------------------------------------------------------------------------------+--+ >> >> | Found 15 items >> | >> >> | -rwxrwxrwx 3 Z56034 hdfs 2075244899 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 2090030620 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 2025516774 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1986848213 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 2018883723 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1984690335 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1987494444 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1974674515 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1963720218 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1965892384 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1974272622 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1971948208 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1968141886 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 1970930771 2016-06-28 10:24 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0 | >> >> | -rwxrwxrwx 3 Z56034 hdfs 192820628 2016-06-28 10:23 >> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0 | >> >> >> +--------------------------------------------------------------------------------------------------------------------------------------------+--+ >> >> 'numFiles'='15', >> >> >> 'numRows'='108363614', >> >> >> >> >> >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> >> >> >> > >