Re: Hive index + Tez engine = no performance gain?!

Gopal Vijayaraghavan Tue, 22 Aug 2017 00:05:33 -0700

TL;DR - A Materialized view is a much more useful construct than trying to get 
limited indexes to work.


That is pretty lively project which has been going on for a while with 
Druid+LLAP

https://issues.apache.org/jira/browse/HIVE-14486

> This seems out of the blue but my initial benchmarks have shown that there's 
> no performance gain when Hive index is used with Tez engine.

Indexes in Hive are broken once columnar formats came in, because there is no 
concept of directly locating a row in another table from an existing table - 
file+offsets doesn't work for a columnar format in any practical sense.

Once you have a fast format, these indexes are only good enough to answer 
queries directly from an index by maintaining a more compact copy of data, 
which is really not what an index is supposed to do.

> I think that index is a perfect solution for non-ORC file format since you 
> can selectively build an index table and leverage Tez to only look at those 
> blocks and/or files that we need to scan

Since LLAP is a distributed system, the concept of "looking up indexes and then 
looking at blocks" does not work the same way it would work for a single 
machine DB with common memory across all threads - the index file scans cannot 
go and specify which other blocks to read, because there is no single frame of 
reference for this.

We can only get meaningful speedups if the index are co-located with the blocks 
on the same machine and can reference column groups (10k rows in each chunk), 
which is exactly what ORC+LLAP manages to do with ORC's bloom filter indexes.

> we still want to have fast ad-hoc query via Hive LLAP / Tez

LLAP can do fast ad-hoc queries at around ~100 million rows/sec on a single 
node, which is pretty fast without needing a separate index + a lookup loop.

In a production cluster, with constant ETL, the global indexes were actually 
lowering throughput since they got stale every few minutes and rebuilding fell 
behind because it is lower priority operations through-out.

Here's an experiment for you to try

CREATE INDEX idx_COUNTRY 
ON TABLE customer(c_birth_country)
AS 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES('AGGREGATES'='count(c_birth_country)')
STORED AS ORC;

Which does build an index in ORC (why not?), can only be made in MR because the 
index handlers weren't implemented in Tez. 

Tez can still use those indexes.

0: jdbc:hive2://localhost:10007/tpcds_bin_par> set 
hive.optimize.index.groupby=true;

0: jdbc:hive2://localhost:10007/tpcds_bin_par> 
 Vertex dependency in root stage                                                
                                                             
 Reducer 2 <- Map 1 (SIMPLE_EDGE)                                               
                                                             
                                                                                
                                                             
 Stage-0                                                                        
                                                             
   Fetch Operator                                                               
                                                             
     limit:-1                                                                   
                                                             
     Stage-1                                                                    
                                                             
       Reducer 2 vectorized
       File Output Operator [FS_8]                                              
                                                             
         Group By Operator [GBY_7] (rows=1060 width=9536)                       
                                                             
           
Output:["_col0","_col1"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0       
                                                  
         <-Map 1 [SIMPLE_EDGE] 
           SHUFFLE [RS_3]                                                       
                                                             
             PartitionCols:_col0                                                
                                                             
             Group By Operator [GBY_2] (rows=2120 width=9536)                   
                                                             
               
Output:["_col0","_col1"],aggregations:["sum(_count_of_c_birth_country)"],keys:c_birth_country
                                 
               Select Operator [SEL_1] (rows=2120 width=9536)                   
                                                             
                 Output:["c_birth_country","_count_of_c_birth_country"]         
                                                             
                 TableScan [TS_0] (rows=2120 width=9536)                        
                                                             
                   
testing@testing__customer_idx_country__,testing__customer_idx_country__,Tbl:COMPLETE,Col:NONE,Output:["c_birth_country"]
  

If you look closely at the index at that point, it is clear that it is not a 
true index, but a materialized view.

And that as a materialized view, the reason this is running slower is due to a 
column which won't happen in a true materialized view

desc testing__customer_idx_country__;
+----------------------------+----------------+----------+--+
|          col_name          |   data_type    | comment  |
+----------------------------+----------------+----------+--+
| c_birth_country            | string         |          |
| _bucketname                | string         |          |
| _offsets                   | array<bigint>  |          |
| _count_of_c_birth_country  | bigint         |          |
+----------------------------+----------------+----------+--+

_offsets was ruining everything, from compression to vectorization.

Which brings me to my actual point - Materialized views are good, but external 
indexes aren't great at being materialized views. At least now that we already 
have internal indexes and in-memory caching for them.

Cheers,
Gopal

Re: Hive index + Tez engine = no performance gain?!

Reply via email to