Thank you for your response. Unfortunately, my query still takes the same amount of time to complete with this new setting. Would I have to do something like: "INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` FROM default__t_x__ where j='and'; (The name default__t_x__ can be found in the output of step 2. Also, /tmp/indexes directory needs to exist in HDFS. You can substitute this to be any pre-existing directory in HDFS) SET hive.index.compact.file=/tmp/indexes/x; SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;" ? I don't know how I could include this within my current query. Cheers, B Subject: Re: Hive indexing optimization From: jpullokka...@hortonworks.com To: user@hive.apache.org Date: Fri, 26 Jun 2015 01:27:21 +0000
Set hive.optimize.index.filter=true; Thanks John From: Bennie Leo <tben...@hotmail.com> Reply-To: "user@hive.apache.org" <user@hive.apache.org> Date: Thursday, June 25, 2015 at 5:48 PM To: "user@hive.apache.org" <user@hive.apache.org> Subject: Hive indexing optimization Hi, I am attempting to optimize a query using indexing. My current query converts an ipv4 address to a country using a geolocation table. However, the geolocation table is fairly large and the query takes an impractical amount of time. I have created indexes and set the binary search parameter to true (default), but the query is not faster. Here is how I set up indexing: DROPINDEXIFEXISTS ipv4indexes ON ipv4geotable; CREATEINDEX ipv4indexes ONTABLE ipv4geotable (StartIp, EndIp) AS'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITHDEFERREDREBUILD IDXPROPERTIES ('hive.index.compact.binary.search'='true'); ALTERINDEX ipv4indexes ON ipv4geotable REBUILD; And here is my query: DROPTABLEIFEXISTS ipv4table; CREATETABLE ipv4table AS SELECT logon.IP, ipv4.Country FROM (SELECT * FROM logontable WHERE isIpv4(IP)) logon LEFTOUTERJOIN (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) WHERE ipv4.StartIp <= logon.IPANDlogon.IP <= ipv4.EndIp; What the query is doing is extracting an IP from logontable and finding in which range it lies within the geolocation table (which is sorted). When a range is found, the corresponding country is returned. The problem is that Hive goes through the whole table row by row rather than performing a smart search (ex: binary search). Any suggestions on how to speed things up? Thank you, B