Ok, thank you. I tried Hive with Tez for my index-problem without any performance 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" < [email protected]> написал:
> > Well I guess I have to agree to differ on this with Jorn as before. > > Vadim, > > Please go ahead and try what Jorn suggests. Report back if you see any > improvement. > > Couple of points if I may: > > Using Hive on Tez is not going to improve Optimiser's performance. That is > just the execution engine and BTW I would rather use Hive on Spark. Both > TEZ and Spark will be a better fit than the usual map-reduce engibe. > > Actually my suggestion would be to use Hive as storage layer only and use > Spark as the query tool. In that case you don't need to worry about indexes > etc in Hive. Spark with DAG and In-memory computing will do a much better > job. > > So > > > 1. Use Hive with its metadata to store data on HDFS > 2. Use Spark SQL to query that Data. Orders of magnitude faster. > > > However, I am all for you trying what Jorn suggested. > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 16 June 2016 at 22:02, Jörn Franke <[email protected]> wrote: > >> The indexes are based on HDFS blocksize, which is usually around 128 mb. >> This means for hitting a single row you must always load the full block. In >> traditional databases this blocksize it is much faster. If the optimizer >> does not pick up the index then you can query the index directly (it is >> just a table!). Keep in mind that you should use for the index also an >> adequate storage format, such as Orc or parquet. >> >> You should not use the traditional indexes, but use Hive+Tez and the Orc >> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of >> key importance that you insert the data sorted on the columns that you use >> in the where clause. You should compress the table with snappy. >> Additionally partitions make sense. Finally please use the right data types >> . Storage indexes work best with ints etc. for text fields you can try >> bloom filters. >> >> That being said, also in other relational databases such as Oracle >> Exadata, the use of traditional indexes is discouraged for warehouse >> scenarios, but storage indexes and columnar formats including compression >> will bring the most performance. >> >> On 16 Jun 2016, at 22:50, Vadim Dedkov <[email protected]> wrote: >> >> Hello! >> >> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support. >> >> My index creation: >> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' >> WITH DEFERRED REBUILD;* >> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;* >> >> Then I set configs: >> *set hive.optimize.autoindex=true;* >> *set hive.optimize.index.filter=true;* >> *set hive.optimize.index.filter.compact.minsize=0;* >> *set hive.index.compact.query.max.size=-1;* >> *set hive.index.compact.query.max.entries=-1; * >> >> And my query is: >> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >> >> Sometimes I have improvement of performance, but most of cases - not. >> >> In cases when I have improvement: >> 1. my query is >> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >> give me NullPointerException (in logs I see that Hive doesn't find my >> index table) >> 2. then I write: >> *USE my_schema_name;* >> *select count(*) from doc_t WHERE id = '3723445235879';* >> and have result with improvement >> (172 sec) >> >> In case when I don't have improvement, I can use either >> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';* >> without exception, either >> *USE my_schema_name;* >> *select count(*) from doc_t WHERE id = '3723445235879';* >> and have result >> (1153 sec) >> >> My table is about 6 billion rows. >> I tried various combinations on index configs, including only these two: >> *set hive.optimize.index.filter=true;* >> *set hive.optimize.index.filter.compact.minsize=0;* >> My hadoop version is 2.6.0-cdh5.5.0 >> >> What I do wrong? >> >> Thank you. >> >> -- >> _______________ _______________ >> Best regards, С уважением >> Vadim Dedkov. Вадим Дедков. >> >> >
