Hi,
You point below: The "traditional" indexes can still make sense for data not in Orc or parquet format. Kindly consider below please A traditional index in an RDBMs is normally a B-tree index with a value for that column and pointer (Row ID)to the row in the data block that keeps the data. In RRDBMS I create a unique index on column OBJECT_ID on table ‘t’ below and do a simple query that can be covered by the index without touching the base table 1> select count(1) from t where OBJECT_ID < 100 2> go QUERY PLAN FOR STATEMENT 1 (at line 1). STEP 1 The type of query is EXECUTE. Executing a newly cached statement (SSQL_ID = 312036659). Total estimated I/O cost for statement 1 (at line 1): 0. QUERY PLAN FOR STATEMENT 1 (at line 0). STEP 1 The type of query is DECLARE. Total estimated I/O cost for statement 1 (at line 0): 0. QUERY PLAN FOR STATEMENT 2 (at line 1). Optimized using Parallel Mode STEP 1 The type of query is SELECT. 3 operator(s) under root |ROOT:EMIT Operator (VA = 3) | | |SCALAR AGGREGATE Operator (VA = 2) | | Evaluate Ungrouped COUNT AGGREGATE. | | | | |RESTRICT Operator (VA = 1)(3)(0)(0)(0)(0) | | | | | | |SCAN Operator (VA = 0) | | | | FROM TABLE | | | | t | | | | Using Clustered Index. | | | | Index : t_ui | | | | Forward Scan. | | | | Positioning by key. | | | | Index contains all needed columns. Base table will not be read. | | | | Keys are: | | | | OBJECT_ID ASC | | | | Using I/O Size 64 Kbytes for index leaf pages. | | | | With LRU Buffer Replacement Strategy for index leaf pages. Total estimated I/O cost for statement 2 (at line 1): 322792. OK so no base table is touched Let us do similar thing by creating an index on OBJECT_ID in that ‘t’ table imported from the said table and creaed in Hive create index t_ui on table t (object_id) as 'COMPACT' WITH DEFERRED REBUILD; alter index t_ui on t rebuild; analyze table t compute statistics; I am going to run the same query in Hive. However, I only see a table scan below and no mention of that index. May be I am missing something here? 0: jdbc:hive2://rhes564:10010/default> explain select count(1) from t where OBJECT_ID < 100; +------------------------------------------------------------------------------------------------------------------+--+ | Explain | +------------------------------------------------------------------------------------------------------------------+--+ | STAGE DEPENDENCIES: | | Stage-1 is a root stage | | Stage-0 depends on stages: Stage-1 | | | | STAGE PLANS: | | Stage: Stage-1 | | Spark | | Edges: | | Reducer 2 <- Map 1 (GROUP, 1) | | DagName: hduser_20160105203204_8d987e9a-415a-476a-8bad-b9a5010e36bf:54 | | Vertices: | | Map 1 | | Map Operator Tree: | | TableScan | | alias: t | | Statistics: Num rows: 2074897 Data size: 64438212 Basic stats: COMPLETE Column stats: NONE | | Filter Operator | | predicate: (object_id < 100) (type: boolean) | | Statistics: Num rows: 691632 Data size: 21479393 Basic stats: COMPLETE Column stats: NONE | | Select Operator | | Statistics: Num rows: 691632 Data size: 21479393 Basic stats: COMPLETE Column stats: NONE | | Group By Operator | | aggregations: count(1) | | mode: hash | | outputColumnNames: _col0 | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | Reduce Output Operator | | sort order: | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | value expressions: _col0 (type: bigint) | | Reducer 2 | | Reduce Operator Tree: | | Group By Operator | | aggregations: count(VALUE._col0) | | mode: mergepartial | | outputColumnNames: _col0 | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | File Output Operator | | compressed: false | | Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE | | table: | | input format: org.apache.hadoop.mapred.TextInputFormat | | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | ListSink | | | +------------------------------------------------------------------------------------------------------------------+--+ Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.coHi, <https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> <https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> Your pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.coHi,%0d%0dYour%20pointm/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: 05 January 2016 19:59 To: user@hive.apache.org Subject: Re: Is Hive Index officially not recommended? Btw this is not Hive specific, but also for other relational database systems, such as Oracle Exadata. On 05 Jan 2016, at 20:57, Jörn Franke <jornfra...@gmail.com <mailto:jornfra...@gmail.com> > wrote: You can still use execution Engine mr for maintaining the index. Indeed with the ORC or parquet format there are min/max indexes and bloom filters, but you need to sort your data appropriately to benefit from performance. Alternatively you can create redundant tables sorted in different order. The "traditional" indexes can still make sense for data not in Orc or parquet format. Keep in mind that for warehouse scenarios there are many other optimization methods in Hive.