Hi,
I am very new to hive optimiser Here I have a table with 4 million rows imported from Oracle via sqoop/hive. In this table object_id column is unique. Oracle table has primary key constraint on object_id column which is basically a unique B-tree index. I do a very simple query to see how many unique values are for object_id in table. The answer is they are as many as number of rows. So query like below in Oracle SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM hddtester.tdash; Should return 1. Now Oracle optimiser only needs to read the index key and work it out WITHOUT touching the underlying table and it does that ---------------------------------------------------------- Plan hash value: 1988751498 ----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 26 | 13952 (1)| 00:02:48 | | 1 | SORT AGGREGATE | | 1 | 26 | | | | 2 | VIEW | VW_DAG_0 | 4000K| 99M| 13952 (1)| 00:02:48 | | 3 | SORT GROUP BY NOSORT| | 4000K| 22M| 13952 (1)| 00:02:48 | | 4 | INDEX FULL SCAN | TDASH_PK | 4000K| 22M| 13952 (1)| 00:02:48 | ----------------------------------------------------------------------------------- Here it is shown as Operation Id = 4 “INDEX FULL SCAN”. Please note that the table itself is not touched as expected Now I have the same table “tdash” in Hive with a compact index on object_id. I have analysed stats for table with “analyze table tdash compute statistics”. Now I do explain as below hive> explain SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM tdash; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: tdash Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: object_id (type: double) outputColumnNames: object_id Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(DISTINCT object_id), count(object_id) keys: object_id (type: double) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: double) sort order: + Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE value expressions: _col2 (type: bigint) Reduce Operator Tree: Group By Operator aggregations: count(DISTINCT KEY._col0:0._col0), count(VALUE._col1) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: (_col0 / _col1) (type: double) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.691 seconds, Fetched: 50 row(s) Trying to understand above does keys: object_id (type: double) refers to use of index here? I dropped that index and the same plan was produced! How Hive optimiser flag the index usage in the plan. Do I need to update column statists as well as table Many thanks Mich Talebzadeh http://talebzadehmich.wordpress.com Publications due shortly: Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: 26 March 2015 17:27 To: user@hive.apache.org Subject: Understanding Hive's execution plan Hi, Can anyone direct me to a good explanation on understanding Hive's execution plan? Thanks, Daniel