RE: Is Hive Index officially not recommended?

Mich Talebzadeh Tue, 05 Jan 2016 12:30:00 -0800

Hi,


You point below:

 

The "traditional" indexes can still make sense for data not in Orc or parquet 
format.

 

Kindly consider below please

 

A traditional index in an RDBMs is normally a B-tree index with a value for 
that column and pointer (Row ID)to the row in the data block that keeps the 
data.

 

 

In RRDBMS I create a unique index on column OBJECT_ID on table ‘t’ below and do 
a simple query that can be covered by the index without touching the base table

 

1> select count(1) from t where OBJECT_ID < 100

2> go

 

QUERY PLAN FOR STATEMENT 1 (at line 1).

 

 

    STEP 1

        The type of query is EXECUTE.

        Executing a newly cached statement (SSQL_ID = 312036659).

 

Total estimated I/O cost for statement 1 (at line 1): 0.

 

 

QUERY PLAN FOR STATEMENT 1 (at line 0).

 

 

    STEP 1

        The type of query is DECLARE.

 

Total estimated I/O cost for statement 1 (at line 0): 0.

 

 

QUERY PLAN FOR STATEMENT 2 (at line 1).

Optimized using Parallel Mode

 

 

    STEP 1

        The type of query is SELECT.

 

        3 operator(s) under root

 

       |ROOT:EMIT Operator (VA = 3)

       |

       |   |SCALAR AGGREGATE Operator (VA = 2)

       |   |  Evaluate Ungrouped COUNT AGGREGATE.

       |   |

       |   |   |RESTRICT Operator (VA = 1)(3)(0)(0)(0)(0)

       |   |   |

       |   |   |   |SCAN Operator (VA = 0)

       |   |   |   |  FROM TABLE

       |   |   |   |  t

       |   |   |   |  Using Clustered Index.

       |   |   |   |  Index : t_ui

       |   |   |   |  Forward Scan.

       |   |   |   |  Positioning by key.

       |   |   |   |  Index contains all needed columns. Base table will not be 
read.

       |   |   |   |  Keys are:

       |   |   |   |    OBJECT_ID ASC

       |   |   |   |  Using I/O Size 64 Kbytes for index leaf pages.

       |   |   |   |  With LRU Buffer Replacement Strategy for index leaf pages.

 

 

Total estimated I/O cost for statement 2 (at line 1): 322792.

 

 

 

OK so no base table is touched

 

Let us do similar thing by creating an index on OBJECT_ID in that ‘t’ table 
imported from the said table and creaed in Hive

 

 

create index t_ui on table t (object_id) as 'COMPACT' WITH DEFERRED REBUILD;

alter index t_ui on t rebuild;

analyze table t compute statistics;

 

 

I am going to run the same query in Hive. However, I only see a table scan 
below and no mention of that index. May be I am missing something here?

 

0: jdbc:hive2://rhes564:10010/default> explain select count(1) from t where 
OBJECT_ID < 100;

+------------------------------------------------------------------------------------------------------------------+--+

|                                                     Explain                   
                                   |

+------------------------------------------------------------------------------------------------------------------+--+

| STAGE DEPENDENCIES:                                                           
                                   |

|   Stage-1 is a root stage                                                     
                                   |

|   Stage-0 depends on stages: Stage-1                                          
                                   |

|                                                                               
                                   |

| STAGE PLANS:                                                                  
                                   |

|   Stage: Stage-1                                                              
                                   |

|     Spark                                                                     
                                   |

|       Edges:                                                                  
                                   |

|         Reducer 2 <- Map 1 (GROUP, 1)                                         
                                   |

|       DagName: hduser_20160105203204_8d987e9a-415a-476a-8bad-b9a5010e36bf:54  
                                   |

|       Vertices:                                                               
                                   |

|         Map 1                                                                 
                                   |

|             Map Operator Tree:                                                
                                   |

|                 TableScan                                                     
                                   |

|                   alias: t                                                    
                                   |

|                   Statistics: Num rows: 2074897 Data size: 64438212 Basic 
stats: COMPLETE Column stats: NONE     |

|                   Filter Operator                                             
                                   |

|                     predicate: (object_id < 100) (type: boolean)              
                                   |

|                     Statistics: Num rows: 691632 Data size: 21479393 Basic 
stats: COMPLETE Column stats: NONE    |

|                     Select Operator                                           
                                   |

|                       Statistics: Num rows: 691632 Data size: 21479393 Basic 
stats: COMPLETE Column stats: NONE  |

|                       Group By Operator                                       
                                   |

|                         aggregations: count(1)                                
                                   |

|                         mode: hash                                            
                                   |

|                         outputColumnNames: _col0                              
                                   |

|                         Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE            |

|                         Reduce Output Operator                                
                                   |

|                           sort order:                                         
                                   |

|                           Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: NONE          |

|                           value expressions: _col0 (type: bigint)             
                                   |

|         Reducer 2                                                             
                                   |

|             Reduce Operator Tree:                                             
                                   |

|               Group By Operator                                               
                                   |

|                 aggregations: count(VALUE._col0)                              
                                   |

|                 mode: mergepartial                                            
                                   |

|                 outputColumnNames: _col0                                      
                                   |

|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE                    |

|                 File Output Operator                                          
                                   |

|                   compressed: false                                           
                                   |

|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE                  |

|                   table:                                                      
                                   |

|                       input format: org.apache.hadoop.mapred.TextInputFormat  
                                   |

|                       output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                  |

|                       serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                              
    |

|                                                                               
                                   |

|   Stage: Stage-0                                                              
                                   |

|     Fetch Operator                                                            
                                   |

|       limit: -1                                                               
                                   |

|       Processor Tree:                                                         
                                   |

|         ListSink                                                              
                                   |

|                                                                               
                                   |

+------------------------------------------------------------------------------------------------------------------+--+

 

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.coHi, 
<https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
 

 
<https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
  

Your pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw 
<https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
 

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 05 January 2016 19:59
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?

 

Btw this is not Hive specific, but also for other relational database systems, 
such as Oracle Exadata.


On 05 Jan 2016, at 20:57, Jörn Franke <jornfra...@gmail.com 
<mailto:jornfra...@gmail.com> > wrote:

You can still use execution Engine mr for maintaining the index. Indeed with 
the ORC or parquet format there are min/max indexes and bloom filters, but you 
need to sort your data appropriately to benefit from performance. Alternatively 
you can create redundant tables sorted in different order.

The "traditional" indexes can still make sense for data not in Orc or parquet 
format.

Keep in mind that for warehouse scenarios there are many other optimization 
methods in Hive.

RE: Is Hive Index officially not recommended?

Reply via email to