Re: Indexing in hive

Ranjith Wed, 16 May 2012 19:42:29 -0700

Thanks Mark. I going to give this a shot a get back with some results.

Thanks,
Ranjith


On May 16, 2012, at 9:27 PM, Mark Grover <[email protected]> wrote:

> Hi Ranjith,
> Here are the steps for using an index in Hive 0.7.1.
> 
> 1) Create the index 
> CREATE INDEX x ON TABLE t(j) 
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
> WITH DEFERRED REBUILD;
> 
> 2) Build the index (since you specified the 'DEFERRED REBUILD' flag in the 
> create index statement
> 
> ALTER INDEX x ON t REBUILD;
> 
> 3) Use the index
> If your original query is: select a, count(*) from t where j='and' group by a;
> and you wanted to use the index on column j, do something like:
> 
> INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` 
> FROM default__t_x__ where j='and';
> (The name default__t_x__ can be found in the output of step 2. Also, 
> /tmp/indexes directory needs to exist in HDFS. You can substitute this to be 
> any pre-existing directory in HDFS)
> SET hive.index.compact.file=/tmp/indexes/x;
> SET 
> hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
> SELECT a, count(*) from t where j='and' group by a;
> 
> Since the semantics of this usage make you specify the compact file, I have 
> not been able to figure out a way to use multiple indexes in the same query. 
> In this case we are using the index on j, the column in the where clause.
> 
> I hope you now understand why indexing in Hive is a work in progress:-)
> 
> Good luck!
> Mark
> 
> 
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation 
> 
> www: oanda.com www: fxtrade.com 
> 
> ----- Original Message -----
> From: "Ranjith Raghunath" <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Sent: Wednesday, May 16, 2012 9:46:23 PM
> Subject: Re: Indexing in hive
> 
> Thanks Mark, Carl, and Ransom. I really appreciate the answers here. I am 
> using Hive 0.7.1 and currently trying to create an index to help with 
> performance associated to a particular where clause. I have not set any 
> properties as mentioned below. I can try the options you listed below. Based 
> on the feedback, I do have a few questions:
> 
> 1. It seems like if you optimize for a groupby you cannot optimize for the 
> where clause. I am reading this correctly?
> 
> 2. How do you build the index? 
> 
> 3. Does the build process create a dataset that contains the index keys along 
> with an offset value associated with the row? 
> 
> 4. And I am guessing that you need to use the columns in the same order as it 
> is defined in the index?
> 
> 
> Thanks,
> Ranjith      
> 
> ----- Original Message -----
> From: Mark Grover [mailto:[email protected]]
> Sent: Wednesday, May 16, 2012 07:52 PM
> To: [email protected] <[email protected]>
> Cc: Zhaojun (Terry) <[email protected]>
> Subject: Re: Indexing in hive
> 
> Ransom,
> From this JIRA (https://issues.apache.org/jira/browse/HIVE-1644), it looks 
> like automatic use of indexes using hive.optimize.index.filter was introduced 
> in Hive 0.8. However, Ranjith seems to be using Hive 0.7.1 which doesn't 
> support those properties.
> 
> Ranjith, you need to set the appropriate priorities before calling your query 
> to make use of indexes. Are you setting any properties? If so, what?
> I will try to dig up what those properties are, in the meanwhile.
> 
> Mark
> 
> ----- Original Message -----
> From: "Hezhiqiang (Ransom)" <[email protected]>
> To: [email protected]
> Cc: "Zhaojun (Terry)" <[email protected]>
> Sent: Wednesday, May 16, 2012 8:32:55 PM
> Subject: RE: Indexing in hive
> 
> 
> 
> 
> “ hive.optimize.index.filter ” is the conf automatically use indexes 
> 
> If u set hive.optimize.index.groupby = true. 
> 
> It will set hive.optimize.index.filter =false. 
> 
> See your configurations. 
> 
> 
> 
> And you need to build index after create index. 
> 
> 
> 
> 
> Best regards 
> 
> Ransom. 
>

Re: Indexing in hive

Reply via email to