Re: Hive indexes without improvement of performance

Mich Talebzadeh Thu, 16 Jun 2016 15:16:50 -0700

Ok use explain extended your sql query to see if the optimizer makes a good
decision.


Help the optimizer by doing stats update at column level

ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS

use desc formatted <TABLE_NAME> to see the stats#

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 23:01, Vadim Dedkov <[email protected]> wrote:

> *without any improvement of performance
> 17 июня 2016 г. 1:00 пользователь "Vadim Dedkov" <[email protected]>
> написал:
>
> Ok, thank you. I tried Hive with Tez for my index-problem without any
>> performance
>> 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" <
>> [email protected]> написал:
>>
>>>
>>> Well I guess I have to agree to differ on this with Jorn as before.
>>>
>>> Vadim,
>>>
>>> Please go ahead and try what Jorn suggests. Report back if you see any
>>> improvement.
>>>
>>> Couple of points if I may:
>>>
>>> Using Hive on Tez is not going to improve Optimiser's performance. That
>>> is just the execution engine and BTW I would rather use Hive on Spark. Both
>>> TEZ and Spark will be a better fit than the usual map-reduce engibe.
>>>
>>> Actually my suggestion would be to use Hive as storage layer only and
>>> use Spark as the query tool. In that case you don't need to worry about
>>> indexes etc in Hive. Spark with DAG and In-memory computing will do a much
>>> better job.
>>>
>>> So
>>>
>>>
>>>    1. Use Hive with its metadata to store data on HDFS
>>>    2. Use Spark SQL to query that Data. Orders of magnitude faster.
>>>
>>>
>>> However, I am all for you trying what Jorn suggested.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 June 2016 at 22:02, Jörn Franke <[email protected]> wrote:
>>>
>>>> The indexes are based on HDFS blocksize, which is usually around 128
>>>> mb. This means for hitting a single row you must always load the full
>>>> block. In traditional databases this blocksize it is much faster. If the
>>>> optimizer does not pick up the index then you can query the index directly
>>>> (it is just a table!). Keep in mind that you should use for the index also
>>>> an adequate storage format, such as Orc or parquet.
>>>>
>>>> You should not use the traditional indexes, but use Hive+Tez and the
>>>> Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It
>>>> is of key importance that you insert the data sorted on the columns that
>>>> you use in the where clause. You should compress the table with snappy.
>>>> Additionally partitions make sense. Finally please use the right data types
>>>> . Storage indexes work best with ints etc. for text fields you can try
>>>> bloom filters.
>>>>
>>>> That being said, also in other relational databases such as Oracle
>>>> Exadata, the use of traditional indexes is discouraged for warehouse
>>>> scenarios, but storage indexes and columnar formats including compression
>>>> will bring the most performance.
>>>>
>>>> On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote:
>>>>
>>>> Hello!
>>>>
>>>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>>>
>>>> My index creation:
>>>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS
>>>> 'COMPACT' WITH DEFERRED REBUILD;*
>>>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>>>
>>>> Then I set configs:
>>>> *set hive.optimize.autoindex=true;*
>>>> *set hive.optimize.index.filter=true;*
>>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>>> *set hive.index.compact.query.max.size=-1;*
>>>> *set hive.index.compact.query.max.entries=-1; *
>>>>
>>>> And my query is:
>>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>>
>>>> Sometimes I have improvement of performance, but most of cases - not.
>>>>
>>>> In cases when I have improvement:
>>>> 1. my query is
>>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>> give me NullPointerException (in logs I see that Hive doesn't find my
>>>> index table)
>>>> 2. then I write:
>>>> *USE my_schema_name;*
>>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>>> and have result with improvement
>>>> (172 sec)
>>>>
>>>> In case when I don't have improvement, I can use either
>>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>> without exception, either
>>>> *USE my_schema_name;*
>>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>>> and have result
>>>> (1153 sec)
>>>>
>>>> My table is about 6 billion rows.
>>>> I tried various combinations on index configs, including only these
>>>> two:
>>>> *set hive.optimize.index.filter=true;*
>>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>>> My hadoop version is 2.6.0-cdh5.5.0
>>>>
>>>> What I do wrong?
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>> _______________             _______________
>>>> Best regards,                    С уважением
>>>> Vadim Dedkov.                  Вадим Дедков.
>>>>
>>>>
>>>

Re: Hive indexes without improvement of performance

Reply via email to