ETL_ROW_ID is to be consecutive number. I need to check if having unique number would not break any logic.
Considering unique number for ETL_ROW_ID column, what are optimum options available? What id it has to be consecutive number only? Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Jun 28, 2016 at 4:17 PM, Markovitz, Dudu <dmarkov...@paypal.com> wrote: > I’m guessing ETL_ROW_ID should be unique but not necessarily contain only > consecutive numbers? > > > > *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com] > *Sent:* Tuesday, June 28, 2016 10:57 PM > *To:* Markovitz, Dudu <dmarkov...@paypal.com> > *Cc:* user@hive.apache.org > *Subject:* Re: Query Performance Issue : Group By and Distinct and load > on reducer > > > > Hi Dudu, > > > > You are correct ...ROW_NUMBER() is main culprit. > > > > ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good solution? > > > > > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > > > On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > The row_number operation seems to be skewed. > > > > Dudu > > > > *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com] > *Sent:* Tuesday, June 28, 2016 8:54 PM > *To:* user@hive.apache.org > *Subject:* Query Performance Issue : Group By and Distinct and load on > reducer > > > > Hi All, > > > I am having performance issue with data skew of the distinct statement in > Hive > <http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>. > See below query with DISTINCT operator. > > *Original Query : * > > > > SELECT DISTINCT > > SD.REGION > > ,SD.HEADEND > ,SD.NETWORK > ,SD.RETAILUNITCODE > ,SD.LOGTIMEDATE > ,SD.SPOTKEY > ,SD.CRE_DT > ,CASE > WHEN SD.LOGTIMEDATE IS NULL > THEN 'Y' > ELSE 'N' > END AS DROP_REASON > ,ROW_NUMBER() OVER ( > ORDER BY NULL > ) AS ETL_ROW_ID > FROM INTER_ETL AS SD; > > > > Table *INTER_ETL *used for query is big enough. > From the logs , it seems that data skew for specific set of values , > causing one of reducer have to do all the job. I tried to achieve the same > through GROUP BY still having the same issue. Help me to understand the > issue and resolution. > > *Query with Distinct V2 :* > > > > CREATE TEMPORARY TABLE ETL_TMP AS > SELECT DISTINCT dt.* > FROM ( > SELECT SD.REGION > ,SD.HEADEND > ,SD.NETWORK > ,SD.RETAILUNITCODE > ,SD.LOGTIMEDATE > ,SD.SPOTKEY > ,SD.CRE_DT > ,CASE > WHEN SD.LOGTIMEDATE IS NULL > THEN 'Y' > ELSE 'N' > END AS DROP_REASON > ,ROW_NUMBER() OVER ( > ORDER BY NULL > ) AS ETL_ROW_ID > FROM INTER_ETL AS SD > ) AS dt; > > > > Logs: > > > > INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 > INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 > INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 > INFO : Map 1: 107/107 Reducer 2: 417(+1)/418 Reducer 3: 0(+56)/418 > INFO : Map 1: 107/107 *Reducer 2: 417(+1)/418* Reducer 3: 0(+418)/418 > > > > > > *Query With Group By:* > > > > CREATE TEMPORARY TABLE ETL_TMP AS > SELECT REGION > ,HEADEND > ,NETWORK > ,RETAILUNITCODE > ,LOGTIMEDATE > ,SPOTKEY > ,CRE_DT > ,DROP_REASON > ,ETL_ROW_ID > FROM ( > SELECT SD.REGION > ,SD.HEADEND > ,SD.NETWORK > ,SD.RETAILUNITCODE > ,SD.LOGTIMEDATE > ,SD.SPOTKEY > ,SD.CRE_DT > ,CASE > WHEN SD.LOGTIMEDATE IS NULL > THEN 'Y' > ELSE 'N' > END AS DROP_REASON > ,ROW_NUMBER() OVER ( > ORDER BY NULL > ) AS ETL_ROW_ID > FROM INTER_ETL AS SD > ) AS dt > GROUP BY > REGION > ,HEADEND > ,NETWORK > ,RETAILUNITCODE > ,LOGTIMEDATE > ,SPOTKEY > ,CRE_DT > ,DROP_REASON > ,ETL_ROW_ID; > > > > Logs: > > > > INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 > INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 > INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 > INFO : Map 1: 818/818 Reducer 2: 417(+1)/418 Reducer 3: 0(+418)/418 > INFO : Map 1: 818/818 *Reducer 2: 417(+1)/418* Reducer 3: 0(+418)/418 > > > > *Table details :* > > > > Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ; > > > +--------------------------------------------------------------------------------------------------------------------------------------------+--+ > > | DFS > Output | > > > +--------------------------------------------------------------------------------------------------------------------------------------------+--+ > > | Found 15 items > | > > | -rwxrwxrwx 3 Z56034 hdfs 2075244899 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 2090030620 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 2025516774 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1986848213 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 2018883723 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1984690335 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1987494444 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1974674515 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1963720218 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1965892384 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1974272622 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1971948208 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1968141886 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 1970930771 2016-06-28 10:24 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0 | > > | -rwxrwxrwx 3 Z56034 hdfs 192820628 2016-06-28 10:23 > /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0 | > > > +--------------------------------------------------------------------------------------------------------------------------------------------+--+ > > 'numFiles'='15', > > > 'numRows'='108363614', > > > > > > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > >