Re: Optimize Hive Query

Mich Talebzadeh Fri, 24 Jun 2016 15:07:14 -0700

Hi Sanjiv,

Normally when it comes to this, I will try to find the section of the code
which cause the largest lag


SELECT
> sb_gu_key, m_d_key, t_ev_st_dt,
> LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY
>  t_ev_st_dt ) AS LAG_START_DT,
> a_z_key,
> c_dt,
> e_p_dt,
> sq_nbr,
> CASE WHEN LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key
>  ORDER BY  t_ev_st_dt ) IS NULL  OR a_z_key <> LAG( a_z_key , 1 , -999 )
>  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt )  THEN 'S'
>  ELSE NULL  END AS ST_FLAG
> FROM  `PRDDB`.tuning_dd_key ;




>From the above query which part is the most time consuming?

For example is the LAG function the most consuming section that takers the
lion's hare of the query?

Just execute the code and comment out LAG(t_ev_st_td) .....  first

I suspect

CASE WHEN LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key
 ORDER BY  t_ev_st_dt ) IS NULL  OR a_z_key <> LAG( a_z_key , 1 , -999 )
 OVER ( PARTITION BY

is the other possible candidate as well with that OR than can cause the
issue

For example you can do the following to measure the timing

select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS
StartTime;
SELECT COUNT(1) FROM PRDDB`.tuning_dd_key
WHERE (LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER
BY  t_ev_st_dt ) IS NULL)
OR
a_z_key <> LAG( a_z_key , 1 , -999 )  OVER ( PARTITION BY  m_d_key ,
sb_gu_key  ORDER BY  t_ev_st_dt )
select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS EndTime;


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Any and all responsibility for any loss, damage or
destruction of data or any other property which may arise from relying on
this email's technical content is explicitly disclaimed. The author will in
no case be liable for any monetary damages arising from such loss, damage
or destruction.



On 24 June 2016 at 22:34, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote:

> Hi Vijay,
>
> Please help me on this....let me know you need other info.
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Thu, Jun 23, 2016 at 12:41 PM, @Sanjiv Singh <sanjiv.is...@gmail.com>
> wrote:
>
>> Hi Gopal,
>>
>> I am using Tez as execution engine.
>>
>> DAG :
>>
>> +--------------------------------------------------------+--+
>> |
>>                                   Explain
>>                                                                         |
>> +---------------------------------------------------------+--+
>> | Plan not optimized by CBO.
>>         |
>> |
>>                            |
>> | Vertex dependency in root stage
>>                             |
>> | Reducer 2 <- Map 1 (SIMPLE_EDGE)
>>                                                           |
>> |
>>                                  |
>> | Stage-0
>>                               |
>> |    Fetch Operator
>>                                                               |
>> |       limit:-1
>>                                   |
>> |       Stage-1
>>                                 |
>> |          Reducer 2
>>                                           |
>> |          File Output Operator [FS_55596]
>>                                                         |
>> |             compressed:false
>>                                                              |
>> |             Statistics:Num rows: 6357592675 Data size: 54076899328
>> Basic stats: COMPLETE Column stats: NONE          |
>> |
>> table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input
>> format:":"org.apache.hadoop.mapred.TextInputFormat","output
>> format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}  |
>> |             Select Operator [SEL_55594]
>>                 |
>> |
>>  
>> outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"]
>>                                    |
>> |                Statistics:Num rows: 6357592675 Data size: 54076899328
>> Basic stats: COMPLETE Column stats: NONE                       |
>> |                PTF Operator [PTF_55593]
>>                                    |
>> |                   Function definitions:[{"Input
>> definition":{"type:":"WINDOWING"}},{"partition by:":"_col0,
>> _col1","name:":"windowingtablefunction","order by:":"_col2"}]      |
>> |                   Statistics:Num rows: 6357592675 Data size:
>> 54076899328 Basic stats: COMPLETE Column stats: NONE           |
>> |                   Select Operator [SEL_55592]
>>                                           |
>> |                   |
>>  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6"]
>>                                                                   |
>> |                   |  Statistics:Num rows: 6357592675 Data size:
>> 54076899328 Basic stats: COMPLETE Column stats: NONE                    |
>> |                   |<-Map 1 [SIMPLE_EDGE] vectorized
>>                                  |
>> |                      Reduce Output Operator [RS_55597]
>>                                |
>> |                         key expressions:m_d_key (type: smallint),
>> sb_gu_key (type: bigint), t_ev_st_dt (type: date)         |
>> |                         Map-reduce partition columns:m_d_key (type:
>> smallint), sb_gu_key (type: bigint)            |
>> |                         sort order:+++
>>        |
>> |                         Statistics:Num rows: 6357592675 Data size:
>> 54076899328 Basic stats: COMPLETE Column stats: NONE
>>  |
>> |                         value expressions:ad_zn_key (type: int), c_dt
>> (type: date), e_p_dt (type: date), sq_nbr (type: int)           |
>> |                         TableScan [TS_55590]
>>                                                             |
>> |                            ACID table:true
>>                                            |
>> |                            alias:tuning_dd_key
>>                         |
>> |                            Statistics:Num rows: 6357592675 Data size:
>> 54076899328 Basic stats: COMPLETE Column stats: NONE                      |
>> |
>>
>>                 |
>>
>> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Thu, Jun 23, 2016 at 2:45 AM, Gopal Vijayaraghavan <gop...@apache.org>
>> wrote:
>>
>>>
>>> > Long running query :
>>>
>>> Are you running this on MapReduce or Tez?
>>>
>>> Please post the output of explain - if you are seeing > 1 shuffle edge in
>>> your query while having only one window for OVER(), that might be the
>>> reason.
>>>
>>> OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt)
>>>
>>>
>>> The multiple PTF operators should have been collapsed by the reduce
>>> sink-deduplication.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>

Re: Optimize Hive Query

Reply via email to