The indexes are based on HDFS blocksize, which is usually around 128 mb. This means for hitting a single row you must always load the full block. In traditional databases this blocksize it is much faster. If the optimizer does not pick up the index then you can query the index directly (it is just a table!). Keep in mind that you should use for the index also an adequate storage format, such as Orc or parquet.
You should not use the traditional indexes, but use Hive+Tez and the Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of key importance that you insert the data sorted on the columns that you use in the where clause. You should compress the table with snappy. Additionally partitions make sense. Finally please use the right data types . Storage indexes work best with ints etc. for text fields you can try bloom filters. That being said, also in other relational databases such as Oracle Exadata, the use of traditional indexes is discouraged for warehouse scenarios, but storage indexes and columnar formats including compression will bring the most performance. > On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote: > > Hello! > > I use Hive 1.1.0-cdh5.5.0 and try to use indexes support. > > My index creation: > CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' WITH > DEFERRED REBUILD; > ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD; > > Then I set configs: > set hive.optimize.autoindex=true; > set hive.optimize.index.filter=true; > set hive.optimize.index.filter.compact.minsize=0; > set hive.index.compact.query.max.size=-1; > set hive.index.compact.query.max.entries=-1; > > And my query is: > select count(*) from my_schema_name.doc_t WHERE id = '3723445235879'; > > Sometimes I have improvement of performance, but most of cases - not. > > In cases when I have improvement: > 1. my query is > select count(*) from my_schema_name.doc_t WHERE id = '3723445235879'; > give me NullPointerException (in logs I see that Hive doesn't find my index > table) > 2. then I write: > USE my_schema_name; > select count(*) from doc_t WHERE id = '3723445235879'; > and have result with improvement > (172 sec) > > In case when I don't have improvement, I can use either > select count(*) from my_schema_name.doc_t WHERE id = '3723445235879'; > without exception, either > USE my_schema_name; > select count(*) from doc_t WHERE id = '3723445235879'; > and have result > (1153 sec) > > My table is about 6 billion rows. > I tried various combinations on index configs, including only these two: > set hive.optimize.index.filter=true; > set hive.optimize.index.filter.compact.minsize=0; > My hadoop version is 2.6.0-cdh5.5.0 > > What I do wrong? > > Thank you. > > -- > _______________ _______________ > Best regards, С уважением > Vadim Dedkov. Вадим Дедков.
