>>If the optimizer does not pick up the index then you can query the index directly Could you explain me, how I can do this for query like
*select * from my_schema_name.doc_t WHERE id = '3723445235879';* ? Thank you 17 июня 2016 г. 0:03 пользователь "Jörn Franke" <[email protected]> написал: > The indexes are based on HDFS blocksize, which is usually around 128 mb. > This means for hitting a single row you must always load the full block. In > traditional databases this blocksize it is much faster. If the optimizer > does not pick up the index then you can query the index directly (it is > just a table!). Keep in mind that you should use for the index also an > adequate storage format, such as Orc or parquet. > > You should not use the traditional indexes, but use Hive+Tez and the Orc > format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of > key importance that you insert the data sorted on the columns that you use > in the where clause. You should compress the table with snappy. > Additionally partitions make sense. Finally please use the right data types > . Storage indexes work best with ints etc. for text fields you can try > bloom filters. > > That being said, also in other relational databases such as Oracle > Exadata, the use of traditional indexes is discouraged for warehouse > scenarios, but storage indexes and columnar formats including compression > will bring the most performance. > > On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote: > > Hello! > > I use Hive 1.1.0-cdh5.5.0 and try to use indexes support. > > My index creation: > *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' > WITH DEFERRED REBUILD;* > *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;* > > Then I set configs: > *set hive.optimize.autoindex=true;* > *set hive.optimize.index.filter=true;* > *set hive.optimize.index.filter.compact.minsize=0;* > *set hive.index.compact.query.max.size=-1;* > *set hive.index.compact.query.max.entries=-1; * > > And my query is: > *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* > > Sometimes I have improvement of performance, but most of cases - not. > > In cases when I have improvement: > 1. my query is > *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* > give me NullPointerException (in logs I see that Hive doesn't find my > index table) > 2. then I write: > *USE my_schema_name;* > *select count(*) from doc_t WHERE id = '3723445235879';* > and have result with improvement > (172 sec) > > In case when I don't have improvement, I can use either > *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* > without exception, either > *USE my_schema_name;* > *select count(*) from doc_t WHERE id = '3723445235879';* > and have result > (1153 sec) > > My table is about 6 billion rows. > I tried various combinations on index configs, including only these two: > *set hive.optimize.index.filter=true;* > *set hive.optimize.index.filter.compact.minsize=0;* > My hadoop version is 2.6.0-cdh5.5.0 > > What I do wrong? > > Thank you. > > -- > _______________ _______________ > Best regards, С уважением > Vadim Dedkov. Вадим Дедков. > >
