Re: Hive indexes without improvement of performance

Vadim Dedkov Thu, 16 Jun 2016 14:50:37 -0700

>>If the optimizer does not pick up the index then you can query the index
directly
Could you explain me, how I can do this for query like


*select * from my_schema_name.doc_t WHERE id = '3723445235879';*

?
Thank you
17 июня 2016 г. 0:03 пользователь "Jörn Franke" <[email protected]>
написал:

> The indexes are based on HDFS blocksize, which is usually around 128 mb.
> This means for hitting a single row you must always load the full block. In
> traditional databases this blocksize it is much faster. If the optimizer
> does not pick up the index then you can query the index directly (it is
> just a table!). Keep in mind that you should use for the index also an
> adequate storage format, such as Orc or parquet.
>
> You should not use the traditional indexes, but use Hive+Tez and the Orc
> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
> key importance that you insert the data sorted on the columns that you use
> in the where clause. You should compress the table with snappy.
> Additionally partitions make sense. Finally please use the right data types
> . Storage indexes work best with ints etc. for text fields you can try
> bloom filters.
>
> That being said, also in other relational databases such as Oracle
> Exadata, the use of traditional indexes is discouraged for warehouse
> scenarios, but storage indexes and columnar formats including compression
> will bring the most performance.
>
> On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote:
>
> Hello!
>
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>
> My index creation:
> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
> WITH DEFERRED REBUILD;*
> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>
> Then I set configs:
> *set hive.optimize.autoindex=true;*
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> *set hive.index.compact.query.max.size=-1;*
> *set hive.index.compact.query.max.entries=-1; *
>
> And my query is:
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>
> Sometimes I have improvement of performance, but most of cases - not.
>
> In cases when I have improvement:
> 1. my query is
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> give me NullPointerException (in logs I see that Hive doesn't find my
> index table)
> 2. then I write:
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result with improvement
> (172 sec)
>
> In case when I don't have improvement, I can use either
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> without exception, either
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result
> (1153 sec)
>
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two:
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> My hadoop version is 2.6.0-cdh5.5.0
>
> What I do wrong?
>
> Thank you.
>
> --
> _______________             _______________
> Best regards,                    С уважением
> Vadim Dedkov.                  Вадим Дедков.
>
>

Re: Hive indexes without improvement of performance

Reply via email to