Re: Hive indexes without improvement of performance

Vadim Dedkov Thu, 16 Jun 2016 15:03:02 -0700

*without any improvement of performance
17 июня 2016 г. 1:00 пользователь "Vadim Dedkov" <[email protected]>
написал:


> Ok, thank you. I tried Hive with Tez for my index-problem without any
> performance
> 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" <
> [email protected]> написал:
>
>>
>> Well I guess I have to agree to differ on this with Jorn as before.
>>
>> Vadim,
>>
>> Please go ahead and try what Jorn suggests. Report back if you see any
>> improvement.
>>
>> Couple of points if I may:
>>
>> Using Hive on Tez is not going to improve Optimiser's performance. That
>> is just the execution engine and BTW I would rather use Hive on Spark. Both
>> TEZ and Spark will be a better fit than the usual map-reduce engibe.
>>
>> Actually my suggestion would be to use Hive as storage layer only and use
>> Spark as the query tool. In that case you don't need to worry about indexes
>> etc in Hive. Spark with DAG and In-memory computing will do a much better
>> job.
>>
>> So
>>
>>
>>    1. Use Hive with its metadata to store data on HDFS
>>    2. Use Spark SQL to query that Data. Orders of magnitude faster.
>>
>>
>> However, I am all for you trying what Jorn suggested.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 June 2016 at 22:02, Jörn Franke <[email protected]> wrote:
>>
>>> The indexes are based on HDFS blocksize, which is usually around 128 mb.
>>> This means for hitting a single row you must always load the full block. In
>>> traditional databases this blocksize it is much faster. If the optimizer
>>> does not pick up the index then you can query the index directly (it is
>>> just a table!). Keep in mind that you should use for the index also an
>>> adequate storage format, such as Orc or parquet.
>>>
>>> You should not use the traditional indexes, but use Hive+Tez and the Orc
>>> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
>>> key importance that you insert the data sorted on the columns that you use
>>> in the where clause. You should compress the table with snappy.
>>> Additionally partitions make sense. Finally please use the right data types
>>> . Storage indexes work best with ints etc. for text fields you can try
>>> bloom filters.
>>>
>>> That being said, also in other relational databases such as Oracle
>>> Exadata, the use of traditional indexes is discouraged for warehouse
>>> scenarios, but storage indexes and columnar formats including compression
>>> will bring the most performance.
>>>
>>> On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote:
>>>
>>> Hello!
>>>
>>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>>
>>> My index creation:
>>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
>>> WITH DEFERRED REBUILD;*
>>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>>
>>> Then I set configs:
>>> *set hive.optimize.autoindex=true;*
>>> *set hive.optimize.index.filter=true;*
>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>> *set hive.index.compact.query.max.size=-1;*
>>> *set hive.index.compact.query.max.entries=-1; *
>>>
>>> And my query is:
>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>
>>> Sometimes I have improvement of performance, but most of cases - not.
>>>
>>> In cases when I have improvement:
>>> 1. my query is
>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>> give me NullPointerException (in logs I see that Hive doesn't find my
>>> index table)
>>> 2. then I write:
>>> *USE my_schema_name;*
>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>> and have result with improvement
>>> (172 sec)
>>>
>>> In case when I don't have improvement, I can use either
>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>> without exception, either
>>> *USE my_schema_name;*
>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>> and have result
>>> (1153 sec)
>>>
>>> My table is about 6 billion rows.
>>> I tried various combinations on index configs, including only these two:
>>> *set hive.optimize.index.filter=true;*
>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>> My hadoop version is 2.6.0-cdh5.5.0
>>>
>>> What I do wrong?
>>>
>>> Thank you.
>>>
>>> --
>>> _______________             _______________
>>> Best regards,                    С уважением
>>> Vadim Dedkov.                  Вадим Дедков.
>>>
>>>
>>

Re: Hive indexes without improvement of performance

Reply via email to