Ok use explain extended your sql query to see if the optimizer makes a good decision.
Help the optimizer by doing stats update at column level ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS use desc formatted <TABLE_NAME> to see the stats# HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 16 June 2016 at 23:01, Vadim Dedkov <[email protected]> wrote: > *without any improvement of performance > 17 июня 2016 г. 1:00 пользователь "Vadim Dedkov" <[email protected]> > написал: > > Ok, thank you. I tried Hive with Tez for my index-problem without any >> performance >> 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" < >> [email protected]> написал: >> >>> >>> Well I guess I have to agree to differ on this with Jorn as before. >>> >>> Vadim, >>> >>> Please go ahead and try what Jorn suggests. Report back if you see any >>> improvement. >>> >>> Couple of points if I may: >>> >>> Using Hive on Tez is not going to improve Optimiser's performance. That >>> is just the execution engine and BTW I would rather use Hive on Spark. Both >>> TEZ and Spark will be a better fit than the usual map-reduce engibe. >>> >>> Actually my suggestion would be to use Hive as storage layer only and >>> use Spark as the query tool. In that case you don't need to worry about >>> indexes etc in Hive. Spark with DAG and In-memory computing will do a much >>> better job. >>> >>> So >>> >>> >>> 1. Use Hive with its metadata to store data on HDFS >>> 2. Use Spark SQL to query that Data. Orders of magnitude faster. >>> >>> >>> However, I am all for you trying what Jorn suggested. >>> >>> HTH >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 16 June 2016 at 22:02, Jörn Franke <[email protected]> wrote: >>> >>>> The indexes are based on HDFS blocksize, which is usually around 128 >>>> mb. This means for hitting a single row you must always load the full >>>> block. In traditional databases this blocksize it is much faster. If the >>>> optimizer does not pick up the index then you can query the index directly >>>> (it is just a table!). Keep in mind that you should use for the index also >>>> an adequate storage format, such as Orc or parquet. >>>> >>>> You should not use the traditional indexes, but use Hive+Tez and the >>>> Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It >>>> is of key importance that you insert the data sorted on the columns that >>>> you use in the where clause. You should compress the table with snappy. >>>> Additionally partitions make sense. Finally please use the right data types >>>> . Storage indexes work best with ints etc. for text fields you can try >>>> bloom filters. >>>> >>>> That being said, also in other relational databases such as Oracle >>>> Exadata, the use of traditional indexes is discouraged for warehouse >>>> scenarios, but storage indexes and columnar formats including compression >>>> will bring the most performance. >>>> >>>> On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote: >>>> >>>> Hello! >>>> >>>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support. >>>> >>>> My index creation: >>>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS >>>> 'COMPACT' WITH DEFERRED REBUILD;* >>>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;* >>>> >>>> Then I set configs: >>>> *set hive.optimize.autoindex=true;* >>>> *set hive.optimize.index.filter=true;* >>>> *set hive.optimize.index.filter.compact.minsize=0;* >>>> *set hive.index.compact.query.max.size=-1;* >>>> *set hive.index.compact.query.max.entries=-1; * >>>> >>>> And my query is: >>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >>>> >>>> Sometimes I have improvement of performance, but most of cases - not. >>>> >>>> In cases when I have improvement: >>>> 1. my query is >>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >>>> give me NullPointerException (in logs I see that Hive doesn't find my >>>> index table) >>>> 2. then I write: >>>> *USE my_schema_name;* >>>> *select count(*) from doc_t WHERE id = '3723445235879';* >>>> and have result with improvement >>>> (172 sec) >>>> >>>> In case when I don't have improvement, I can use either >>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >>>> without exception, either >>>> *USE my_schema_name;* >>>> *select count(*) from doc_t WHERE id = '3723445235879';* >>>> and have result >>>> (1153 sec) >>>> >>>> My table is about 6 billion rows. >>>> I tried various combinations on index configs, including only these >>>> two: >>>> *set hive.optimize.index.filter=true;* >>>> *set hive.optimize.index.filter.compact.minsize=0;* >>>> My hadoop version is 2.6.0-cdh5.5.0 >>>> >>>> What I do wrong? >>>> >>>> Thank you. >>>> >>>> -- >>>> _______________ _______________ >>>> Best regards, С уважением >>>> Vadim Dedkov. Вадим Дедков. >>>> >>>> >>>
