*without any improvement of performance 17 июня 2016 г. 1:00 пользователь "Vadim Dedkov" <[email protected]> написал:
> Ok, thank you. I tried Hive with Tez for my index-problem without any > performance > 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" < > [email protected]> написал: > >> >> Well I guess I have to agree to differ on this with Jorn as before. >> >> Vadim, >> >> Please go ahead and try what Jorn suggests. Report back if you see any >> improvement. >> >> Couple of points if I may: >> >> Using Hive on Tez is not going to improve Optimiser's performance. That >> is just the execution engine and BTW I would rather use Hive on Spark. Both >> TEZ and Spark will be a better fit than the usual map-reduce engibe. >> >> Actually my suggestion would be to use Hive as storage layer only and use >> Spark as the query tool. In that case you don't need to worry about indexes >> etc in Hive. Spark with DAG and In-memory computing will do a much better >> job. >> >> So >> >> >> 1. Use Hive with its metadata to store data on HDFS >> 2. Use Spark SQL to query that Data. Orders of magnitude faster. >> >> >> However, I am all for you trying what Jorn suggested. >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 16 June 2016 at 22:02, Jörn Franke <[email protected]> wrote: >> >>> The indexes are based on HDFS blocksize, which is usually around 128 mb. >>> This means for hitting a single row you must always load the full block. In >>> traditional databases this blocksize it is much faster. If the optimizer >>> does not pick up the index then you can query the index directly (it is >>> just a table!). Keep in mind that you should use for the index also an >>> adequate storage format, such as Orc or parquet. >>> >>> You should not use the traditional indexes, but use Hive+Tez and the Orc >>> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of >>> key importance that you insert the data sorted on the columns that you use >>> in the where clause. You should compress the table with snappy. >>> Additionally partitions make sense. Finally please use the right data types >>> . Storage indexes work best with ints etc. for text fields you can try >>> bloom filters. >>> >>> That being said, also in other relational databases such as Oracle >>> Exadata, the use of traditional indexes is discouraged for warehouse >>> scenarios, but storage indexes and columnar formats including compression >>> will bring the most performance. >>> >>> On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote: >>> >>> Hello! >>> >>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support. >>> >>> My index creation: >>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' >>> WITH DEFERRED REBUILD;* >>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;* >>> >>> Then I set configs: >>> *set hive.optimize.autoindex=true;* >>> *set hive.optimize.index.filter=true;* >>> *set hive.optimize.index.filter.compact.minsize=0;* >>> *set hive.index.compact.query.max.size=-1;* >>> *set hive.index.compact.query.max.entries=-1; * >>> >>> And my query is: >>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >>> >>> Sometimes I have improvement of performance, but most of cases - not. >>> >>> In cases when I have improvement: >>> 1. my query is >>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >>> give me NullPointerException (in logs I see that Hive doesn't find my >>> index table) >>> 2. then I write: >>> *USE my_schema_name;* >>> *select count(*) from doc_t WHERE id = '3723445235879';* >>> and have result with improvement >>> (172 sec) >>> >>> In case when I don't have improvement, I can use either >>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >>> without exception, either >>> *USE my_schema_name;* >>> *select count(*) from doc_t WHERE id = '3723445235879';* >>> and have result >>> (1153 sec) >>> >>> My table is about 6 billion rows. >>> I tried various combinations on index configs, including only these two: >>> *set hive.optimize.index.filter=true;* >>> *set hive.optimize.index.filter.compact.minsize=0;* >>> My hadoop version is 2.6.0-cdh5.5.0 >>> >>> What I do wrong? >>> >>> Thank you. >>> >>> -- >>> _______________ _______________ >>> Best regards, С уважением >>> Vadim Dedkov. Вадим Дедков. >>> >>> >>
