Thanks Mark. I going to give this a shot a get back with some results. Thanks, Ranjith
On May 16, 2012, at 9:27 PM, Mark Grover <[email protected]> wrote: > Hi Ranjith, > Here are the steps for using an index in Hive 0.7.1. > > 1) Create the index > CREATE INDEX x ON TABLE t(j) > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' > WITH DEFERRED REBUILD; > > 2) Build the index (since you specified the 'DEFERRED REBUILD' flag in the > create index statement > > ALTER INDEX x ON t REBUILD; > > 3) Use the index > If your original query is: select a, count(*) from t where j='and' group by a; > and you wanted to use the index on column j, do something like: > > INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` > FROM default__t_x__ where j='and'; > (The name default__t_x__ can be found in the output of step 2. Also, > /tmp/indexes directory needs to exist in HDFS. You can substitute this to be > any pre-existing directory in HDFS) > SET hive.index.compact.file=/tmp/indexes/x; > SET > hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat; > SELECT a, count(*) from t where j='and' group by a; > > Since the semantics of this usage make you specify the compact file, I have > not been able to figure out a way to use multiple indexes in the same query. > In this case we are using the index on j, the column in the where clause. > > I hope you now understand why indexing in Hive is a work in progress:-) > > Good luck! > Mark > > > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > > ----- Original Message ----- > From: "Ranjith Raghunath" <[email protected]> > To: "[email protected]" <[email protected]> > Cc: "[email protected]" <[email protected]> > Sent: Wednesday, May 16, 2012 9:46:23 PM > Subject: Re: Indexing in hive > > Thanks Mark, Carl, and Ransom. I really appreciate the answers here. I am > using Hive 0.7.1 and currently trying to create an index to help with > performance associated to a particular where clause. I have not set any > properties as mentioned below. I can try the options you listed below. Based > on the feedback, I do have a few questions: > > 1. It seems like if you optimize for a groupby you cannot optimize for the > where clause. I am reading this correctly? > > 2. How do you build the index? > > 3. Does the build process create a dataset that contains the index keys along > with an offset value associated with the row? > > 4. And I am guessing that you need to use the columns in the same order as it > is defined in the index? > > > Thanks, > Ranjith > > ----- Original Message ----- > From: Mark Grover [mailto:[email protected]] > Sent: Wednesday, May 16, 2012 07:52 PM > To: [email protected] <[email protected]> > Cc: Zhaojun (Terry) <[email protected]> > Subject: Re: Indexing in hive > > Ransom, > From this JIRA (https://issues.apache.org/jira/browse/HIVE-1644), it looks > like automatic use of indexes using hive.optimize.index.filter was introduced > in Hive 0.8. However, Ranjith seems to be using Hive 0.7.1 which doesn't > support those properties. > > Ranjith, you need to set the appropriate priorities before calling your query > to make use of indexes. Are you setting any properties? If so, what? > I will try to dig up what those properties are, in the meanwhile. > > Mark > > ----- Original Message ----- > From: "Hezhiqiang (Ransom)" <[email protected]> > To: [email protected] > Cc: "Zhaojun (Terry)" <[email protected]> > Sent: Wednesday, May 16, 2012 8:32:55 PM > Subject: RE: Indexing in hive > > > > > “ hive.optimize.index.filter ” is the conf automatically use indexes > > If u set hive.optimize.index.groupby = true. > > It will set hive.optimize.index.filter =false. > > See your configurations. > > > > And you need to build index after create index. > > > > > Best regards > > Ransom. >
