Re: Hive indexes without improvement of performance

Jörn Franke Thu, 16 Jun 2016 14:03:51 -0700

The indexes are based on HDFS blocksize, which is usually around 128 mb. This 
means for hitting a single row you must always load the full block. In 
traditional databases this blocksize it is much faster. If the optimizer does 
not pick up the index then you can query the index directly (it is just a 
table!). Keep in mind that you should use for the index also an adequate 
storage format, such as Orc or parquet.


You should not use the traditional indexes, but use Hive+Tez and the Orc format 
with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of key 
importance that you insert the data sorted on the columns that you use in the 
where clause. You should compress the table with snappy. Additionally 
partitions make sense. Finally please use the right data types . Storage 
indexes work best with ints etc. for text fields you can try bloom filters.

That being said, also in other relational databases such as Oracle Exadata, the 
use of traditional indexes is discouraged for warehouse scenarios, but storage 
indexes and columnar formats including compression will bring the most 
performance.

> On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote:
> 
> Hello!
> 
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>  
> My index creation:
> CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' WITH 
> DEFERRED REBUILD;
> ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;
> 
> Then I set configs:
> set hive.optimize.autoindex=true;
> set hive.optimize.index.filter=true;
> set hive.optimize.index.filter.compact.minsize=0;
> set hive.index.compact.query.max.size=-1;
> set hive.index.compact.query.max.entries=-1; 
> 
> And my query is:
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> 
> Sometimes I have improvement of performance, but most of cases - not.
> 
> In cases when I have improvement:
> 1. my query is
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> give me NullPointerException (in logs I see that Hive doesn't find my index 
> table)
> 2. then I write:
> USE my_schema_name;
> select count(*) from doc_t WHERE id = '3723445235879';
> and have result with improvement
> (172 sec)
> 
> In case when I don't have improvement, I can use either
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> without exception, either
> USE my_schema_name;
> select count(*) from doc_t WHERE id = '3723445235879';
> and have result
> (1153 sec)
> 
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two: 
> set hive.optimize.index.filter=true;
> set hive.optimize.index.filter.compact.minsize=0;
> My hadoop version is 2.6.0-cdh5.5.0
> 
> What I do wrong?
> 
> Thank you.
> 
> -- 
> _______________             _______________
> Best regards,                    С уважением
> Vadim Dedkov.                  Вадим Дедков.

Re: Hive indexes without improvement of performance

Reply via email to