Funny enough it is pretty close to similar ORC transactional tables I have. Standard with 256 buckets with two columns as below
number of distinct value in column m_d_key : 29 > number of distinct value in column sb_gu_key : 15434343 You have also vectorised data taking 1024 rows at once. Still the optimizer does not tell me much. Also I don't use TEZ. I use Spark as the execution engine. From my experience (and I am sure there will be plenty who will disagree with me :)), the optimiser does not make much difference, it is the execution engine than delivers the performance. The other alternative is that when you populate the table insert the data sorted by m_d_key, sb_gu_key, t_ev_st_dt to ensure that that the optimizer will be better off. Also may help if you add t_ev_st_dt as the third column of the bucket as the LAG() function is using it in ORDER BY CLUSTERED BY (m_d_key, sb_gu_key, t_ev_st_dt) INTO 256 BUCKETS STORED AS ORC TBLPROPERTIES ( "transactional"="true", "orc.create.index"="true", "orc.bloom.filter.columns"="m_d_key, sb_gu_key, t_ev_st_dt", "orc.bloom.filter.fpp"="0.05", "orc.stripe.size"="16777216", "orc.row.index.stride"="10000" ) Others may have better ideas. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 23 June 2016 at 20:30, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote: > Hi Mich , > > Please find below output of command. > > desc formatted tuning_dd_key ; > > > +-------------------------------+-----------------------------------------------------------------------------------------------+-----------------------+--+ > | col_name | > data_type | comment > | > > +-------------------------------+-----------------------------------------------------------------------------------------------+-----------------------+--+ > | # col_name | data_type > | comment > | > | | NULL > | NULL > | > | m_d_key | smallint > | > | > | sb_gu_key | bigint > | > | > | t_ev_st_dt | date > | > | > | ad_zn_key | int > | > | > | c_dt | date > | > | > | e_p_dt | date > | > | > | sq_nbr | int > | > | > | | NULL > | NULL > | > | # Detailed Table Information | NULL > | NULL > | > | Database: | PRDDB > | NULL > | > | CreateTime: | Thu Jun 23 11:03:53 EDT 2016 > | NULL > | > | LastAccessTime: | UNKNOWN > | NULL > | > | Protect Mode: | None > | NULL > | > | Retention: | 0 > | NULL > | > | Table Type: | MANAGED_TABLE > | NULL > | > | Table Parameters: | NULL > | NULL > | > | | COLUMN_STATS_ACCURATE > | true > | > | | numFiles > | 256 > | > | | numRows > | 6357592675 > | > | | rawDataSize > | 0 > | > | | totalSize > | 54076898961 > | > | | transactional > | true > | > | | transient_lastDdlTime > | 1466694970 > | > | | NULL > | NULL > | > | # Storage Information | NULL > | NULL > | > | SerDe Library: | > org.apache.hadoop.hive.ql.io.orc.OrcSerde > | NULL | > | InputFormat: | > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > | NULL | > | OutputFormat: | > org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > | NULL | > | Compressed: | No > | NULL > | > | Num Buckets: | 256 > | NULL > | > | Bucket Columns: | [sbsc_guid_key, mas_div_key] > | NULL > | > | Sort Columns: | [] > | NULL > | > | Storage Desc Params: | NULL > | NULL > | > | | serialization.format > | 1 > | > > +-------------------------------+-----------------------------------------------------------------------------------------------+-----------------------+--+ > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Thu, Jun 23, 2016 at 12:47 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Do you also have the output from >> >> desc formatted tuning_dd_key >> >> and send the output please? >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 23 June 2016 at 17:41, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote: >> >>> Hi Gopal, >>> >>> I am using Tez as execution engine. >>> >>> DAG : >>> >>> +--------------------------------------------------------+--+ >>> | >>> Explain >>> | >>> +---------------------------------------------------------+--+ >>> | Plan not optimized by CBO. >>> | >>> | >>> | >>> | Vertex dependency in root stage >>> | >>> | Reducer 2 <- Map 1 (SIMPLE_EDGE) >>> | >>> | >>> | >>> | Stage-0 >>> | >>> | Fetch Operator >>> | >>> | limit:-1 >>> | >>> | Stage-1 >>> | >>> | Reducer 2 >>> | >>> | File Output Operator [FS_55596] >>> | >>> | compressed:false >>> | >>> | Statistics:Num rows: 6357592675 Data size: 54076899328 >>> Basic stats: COMPLETE Column stats: NONE | >>> | >>> table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input >>> format:":"org.apache.hadoop.mapred.TextInputFormat","output >>> format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"} | >>> | Select Operator [SEL_55594] >>> | >>> | >>> >>> outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"] >>> | >>> | Statistics:Num rows: 6357592675 Data size: 54076899328 >>> Basic stats: COMPLETE Column stats: NONE | >>> | PTF Operator [PTF_55593] >>> | >>> | Function definitions:[{"Input >>> definition":{"type:":"WINDOWING"}},{"partition by:":"_col0, >>> _col1","name:":"windowingtablefunction","order by:":"_col2"}] | >>> | Statistics:Num rows: 6357592675 Data size: >>> 54076899328 Basic stats: COMPLETE Column stats: NONE | >>> | Select Operator [SEL_55592] >>> | >>> | | >>> outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6"] >>> | >>> | | Statistics:Num rows: 6357592675 Data size: >>> 54076899328 Basic stats: COMPLETE Column stats: NONE | >>> | |<-Map 1 [SIMPLE_EDGE] vectorized >>> | >>> | Reduce Output Operator [RS_55597] >>> | >>> | key expressions:m_d_key (type: smallint), >>> sb_gu_key (type: bigint), t_ev_st_dt (type: date) | >>> | Map-reduce partition columns:m_d_key (type: >>> smallint), sb_gu_key (type: bigint) | >>> | sort order:+++ >>> | >>> | Statistics:Num rows: 6357592675 Data size: >>> 54076899328 Basic stats: COMPLETE Column stats: NONE >>> | >>> | value expressions:ad_zn_key (type: int), c_dt >>> (type: date), e_p_dt (type: date), sq_nbr (type: int) | >>> | TableScan [TS_55590] >>> | >>> | ACID table:true >>> | >>> | alias:tuning_dd_key >>> | >>> | Statistics:Num rows: 6357592675 Data size: >>> 54076899328 Basic stats: COMPLETE Column stats: NONE | >>> | >>> >>> | >>> >>> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+ >>> >>> Regards >>> Sanjiv Singh >>> Mob : +091 9990-447-339 >>> >>> On Thu, Jun 23, 2016 at 2:45 AM, Gopal Vijayaraghavan <gop...@apache.org >>> > wrote: >>> >>>> >>>> > Long running query : >>>> >>>> Are you running this on MapReduce or Tez? >>>> >>>> Please post the output of explain - if you are seeing > 1 shuffle edge >>>> in >>>> your query while having only one window for OVER(), that might be the >>>> reason. >>>> >>>> OVER ( PARTITION BY m_d_key , sb_gu_key ORDER BY t_ev_st_dt) >>>> >>>> >>>> The multiple PTF operators should have been collapsed by the reduce >>>> sink-deduplication. >>>> >>>> Cheers, >>>> Gopal >>>> >>>> >>>> >>> >> >