Re: Optimize Hive Query

Mich Talebzadeh Thu, 23 Jun 2016 13:48:58 -0700

Funny enough it is pretty close to similar ORC transactional tables I have.
Standard with 256 buckets with two columns as below


number of distinct value in column m_d_key : 29
> number of distinct value in column sb_gu_key : 15434343


You have also vectorised data taking 1024 rows at once.

Still the optimizer does not tell me much. Also I don't use TEZ. I use
Spark as the execution engine.  From my experience (and I am sure there
will be plenty who will disagree with me :)), the optimiser does not make
much difference, it is the execution engine than delivers the performance.

The other alternative is that when you populate the table insert the data
sorted by m_d_key, sb_gu_key, t_ev_st_dt to ensure that that the optimizer
will be better off.

Also may help if you add t_ev_st_dt   as the third column of the bucket as
the LAG() function is using it in ORDER BY


CLUSTERED BY (m_d_key, sb_gu_key, t_ev_st_dt)  INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
  "transactional"="true",
  "orc.create.index"="true",
  "orc.bloom.filter.columns"="m_d_key, sb_gu_key, t_ev_st_dt",
  "orc.bloom.filter.fpp"="0.05",
  "orc.stripe.size"="16777216",
  "orc.row.index.stride"="10000"
)

Others may have better ideas.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 June 2016 at 20:30, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote:

> Hi Mich ,
>
> Please find below output of command.
>
> desc formatted tuning_dd_key ;
>
>
> +-------------------------------+-----------------------------------------------------------------------------------------------+-----------------------+--+
> |           col_name            |
>   data_type                                           |        comment
>    |
>
> +-------------------------------+-----------------------------------------------------------------------------------------------+-----------------------+--+
> | # col_name                    | data_type
>                                                       | comment
>   |
> |                               | NULL
>                                                      | NULL
>  |
> | m_d_key                       | smallint
>                                                      |
>   |
> | sb_gu_key                     | bigint
>                                                      |
>   |
> | t_ev_st_dt                    | date
>                                                      |
>   |
> | ad_zn_key                     | int
>                                                       |
>   |
> | c_dt                          | date
>                                                      |
>   |
> | e_p_dt                        | date
>                                                      |
>   |
> | sq_nbr                        | int
>                                                       |
>   |
> |                               | NULL
>                                                      | NULL
>  |
> | # Detailed Table Information  | NULL
>                                                      | NULL
>  |
> | Database:                     | PRDDB
>                                                       | NULL
>    |
> | CreateTime:                   | Thu Jun 23 11:03:53 EDT 2016
>                                                      | NULL
>  |
> | LastAccessTime:               | UNKNOWN
>                                                       | NULL
>    |
> | Protect Mode:                 | None
>                                                      | NULL
>  |
> | Retention:                    | 0
>                                                       | NULL
>    |
> | Table Type:                   | MANAGED_TABLE
>                                                       | NULL
>    |
> | Table Parameters:             | NULL
>                                                      | NULL
>  |
> |                               | COLUMN_STATS_ACCURATE
>                                                       | true
>    |
> |                               | numFiles
>                                                      | 256
>   |
> |                               | numRows
>                                                       | 6357592675
>    |
> |                               | rawDataSize
>                                                       | 0
>   |
> |                               | totalSize
>                                                       | 54076898961
>   |
> |                               | transactional
>                                                       | true
>    |
> |                               | transient_lastDdlTime
>                                                       | 1466694970
>    |
> |                               | NULL
>                                                      | NULL
>  |
> | # Storage Information         | NULL
>                                                      | NULL
>  |
> | SerDe Library:                |
> org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                   | NULL                  |
> | InputFormat:                  |
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                   | NULL                  |
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                    | NULL                  |
> | Compressed:                   | No
>                                                      | NULL
>  |
> | Num Buckets:                  | 256
>                                                       | NULL
>    |
> | Bucket Columns:               | [sbsc_guid_key, mas_div_key]
>                                                      | NULL
>  |
> | Sort Columns:                 | []
>                                                      | NULL
>  |
> | Storage Desc Params:          | NULL
>                                                      | NULL
>  |
> |                               | serialization.format
>                                                      | 1
>   |
>
> +-------------------------------+-----------------------------------------------------------------------------------------------+-----------------------+--+
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Thu, Jun 23, 2016 at 12:47 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Do you also have the output from
>>
>> desc formatted tuning_dd_key
>>
>>  and send the output please?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 23 June 2016 at 17:41, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote:
>>
>>> Hi Gopal,
>>>
>>> I am using Tez as execution engine.
>>>
>>> DAG :
>>>
>>> +--------------------------------------------------------+--+
>>> |
>>>                                     Explain
>>>                                                                           |
>>> +---------------------------------------------------------+--+
>>> | Plan not optimized by CBO.
>>>         |
>>> |
>>>                              |
>>> | Vertex dependency in root stage
>>>                               |
>>> | Reducer 2 <- Map 1 (SIMPLE_EDGE)
>>>                                                           |
>>> |
>>>                                    |
>>> | Stage-0
>>>                                 |
>>> |    Fetch Operator
>>>                                                                 |
>>> |       limit:-1
>>>                                   |
>>> |       Stage-1
>>>                                   |
>>> |          Reducer 2
>>>                                           |
>>> |          File Output Operator [FS_55596]
>>>                                                         |
>>> |             compressed:false
>>>                                                              |
>>> |             Statistics:Num rows: 6357592675 Data size: 54076899328
>>> Basic stats: COMPLETE Column stats: NONE          |
>>> |
>>> table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input
>>> format:":"org.apache.hadoop.mapred.TextInputFormat","output
>>> format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}  |
>>> |             Select Operator [SEL_55594]
>>>                   |
>>> |
>>>  
>>> outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"]
>>>                                    |
>>> |                Statistics:Num rows: 6357592675 Data size: 54076899328
>>> Basic stats: COMPLETE Column stats: NONE                       |
>>> |                PTF Operator [PTF_55593]
>>>                                      |
>>> |                   Function definitions:[{"Input
>>> definition":{"type:":"WINDOWING"}},{"partition by:":"_col0,
>>> _col1","name:":"windowingtablefunction","order by:":"_col2"}]      |
>>> |                   Statistics:Num rows: 6357592675 Data size:
>>> 54076899328 Basic stats: COMPLETE Column stats: NONE           |
>>> |                   Select Operator [SEL_55592]
>>>                                             |
>>> |                   |
>>>  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6"]
>>>                                                                   |
>>> |                   |  Statistics:Num rows: 6357592675 Data size:
>>> 54076899328 Basic stats: COMPLETE Column stats: NONE                    |
>>> |                   |<-Map 1 [SIMPLE_EDGE] vectorized
>>>                                    |
>>> |                      Reduce Output Operator [RS_55597]
>>>                                |
>>> |                         key expressions:m_d_key (type: smallint),
>>> sb_gu_key (type: bigint), t_ev_st_dt (type: date)         |
>>> |                         Map-reduce partition columns:m_d_key (type:
>>> smallint), sb_gu_key (type: bigint)            |
>>> |                         sort order:+++
>>>        |
>>> |                         Statistics:Num rows: 6357592675 Data size:
>>> 54076899328 Basic stats: COMPLETE Column stats: NONE
>>>  |
>>> |                         value expressions:ad_zn_key (type: int), c_dt
>>> (type: date), e_p_dt (type: date), sq_nbr (type: int)           |
>>> |                         TableScan [TS_55590]
>>>                                                             |
>>> |                            ACID table:true
>>>                                            |
>>> |                            alias:tuning_dd_key
>>>                         |
>>> |                            Statistics:Num rows: 6357592675 Data size:
>>> 54076899328 Basic stats: COMPLETE Column stats: NONE                      |
>>> |
>>>
>>>                   |
>>>
>>> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Thu, Jun 23, 2016 at 2:45 AM, Gopal Vijayaraghavan <gop...@apache.org
>>> > wrote:
>>>
>>>>
>>>> > Long running query :
>>>>
>>>> Are you running this on MapReduce or Tez?
>>>>
>>>> Please post the output of explain - if you are seeing > 1 shuffle edge
>>>> in
>>>> your query while having only one window for OVER(), that might be the
>>>> reason.
>>>>
>>>> OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt)
>>>>
>>>>
>>>> The multiple PTF operators should have been collapsed by the reduce
>>>> sink-deduplication.
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>
>>
>

Re: Optimize Hive Query

Reply via email to