Thank you for your response. Unfortunately, my query still takes the same 
amount of time to complete with this new setting. 
 
Would I have to do something like:
 
"INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` 
FROM default__t_x__ where j='and';
(The name default__t_x__ can be found in the output of step 2. Also, 
/tmp/indexes directory needs to exist in HDFS. You can substitute this to be 
any pre-existing directory in HDFS)
SET hive.index.compact.file=/tmp/indexes/x;
SET 
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;"
 ?
 
I don't know how I could include this within my current query.
 
Cheers,
B
 
Subject: Re: Hive indexing optimization
From: jpullokka...@hortonworks.com
To: user@hive.apache.org
Date: Fri, 26 Jun 2015 01:27:21 +0000






Set hive.optimize.index.filter=true;



Thanks
John





From: Bennie Leo <tben...@hotmail.com>

Reply-To: "user@hive.apache.org" <user@hive.apache.org>

Date: Thursday, June 25, 2015 at 5:48 PM

To: "user@hive.apache.org" <user@hive.apache.org>

Subject: Hive indexing optimization










Hi,

 

I am attempting to optimize a query using indexing. My current query converts 
an ipv4 address to a country using a geolocation table. However, the 
geolocation table is fairly large and the query takes an impractical amount of 
time. I have created indexes and
 set the binary search parameter to true (default), but the query is not 
faster. 

 

Here is how I set up indexing:



DROPINDEXIFEXISTS
 ipv4indexes ON ipv4geotable;

CREATEINDEX
 ipv4indexes 

ONTABLE
 ipv4geotable (StartIp, EndIp)

AS'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

WITHDEFERREDREBUILD

IDXPROPERTIES ('hive.index.compact.binary.search'='true');


 


ALTERINDEX
 ipv4indexes ON ipv4geotable
REBUILD;


 

And here is my query:

 

DROPTABLEIFEXISTS
 ipv4table;

CREATETABLE
 ipv4table AS

SELECT logon.IP, ipv4.Country

FROM

(SELECT
 * FROM logontable
WHERE isIpv4(IP)) logon

LEFTOUTERJOIN

(SELECT
 StartIp, EndIp, Country FROM
 ipv4geotable) ipv4 ON
 isIpv4(logon.IP) 

WHERE ipv4.StartIp <=
logon.IPANDlogon.IP
 <= ipv4.EndIp;


 

What the query is doing is extracting an IP from logontable and finding in 
which range it lies within the geolocation table (which is sorted). When a 
range is found, the corresponding country is returned. The problem is that
 Hive goes through the whole table row by row rather than performing a smart 
search (ex: binary search).


 

Any suggestions on how to speed things up? 

 

Thank you,

B






                                          

Reply via email to