Hello guys, Any updates on the issue? Anyone?! ;))
Best regards, Matthew Tovbin =) On Tue, Sep 20, 2011 at 09:41, Matthew Tovbin <matt...@tovbin.com> wrote: > Thanks Jean and Sandy. > > I have hive 0.7.1, and according to this patch > https://issues.apache.org/jira/browse/HIVE-1226 at least exact match > queries like "...where id = '12345'-123' " or partial pushdown "...where id > like "12345%" should work, but I didn't notice it. > > Matthew. > > > > On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <prat...@adobe.com> wrote: > >> I suffered the same let down a little while ago. I believe this is the >> relevant JIRA: >> >> https://issues.apache.org/jira/browse/HIVE-1643 >> >> I'd also like to see Hive be able to limit scans to particular HBase >> version ranges, but I don't know if that's even planned. >> >> Sandy >> >> > -----Original Message----- >> > From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean- >> > Daniel Cryans >> > Sent: Monday, September 19, 2011 09:58 >> > To: user@hbase.apache.org >> > Subject: Re: Hbase-Hive integration performance issues >> > >> > (replying to user@, dev@ in BCC) >> > >> > AFAIK the HBase handler doesn't have the wits to understand that you are >> > doing a prefix scan and thus limit the scan to only the required rows. >> There's >> > a bunch of optimizations like that that need to be done. >> > >> > I'm pretty sure Pig does the same thing, but don't take my word on it. >> > >> > J-D >> > >> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <matt...@tovbin.com> >> > wrote: >> > > Hi guys, >> > > >> > > I've got a table in Hbase let's say "tbl" and I would like to query it >> > > using Hive. Therefore I mapped a table to hive as follows: >> > > >> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) STORED >> > > BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' >> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:") >> > > TBLPROPERTIES("hbase.table.name" = "tbl"); >> > > >> > > Queries like: "select * from tbl", "select id from tbl", "select id, >> > > data from tbl" are really fast. >> > > But queries like "select id from tbl where substr(id, 0, 5) = "12345"" >> > > or "select id from tbl where data["777"] IS NOT NULL" are incredibly >> slow. >> > > >> > > In the contrary when running from Hbase shell: "scan 'tbl', { >> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', { >> > > COLUMNS=>'data', "FILTER" => >> > > FilterList.new([qualifierFilter('777')])}" >> > > it is lightning fast! >> > > >> > > When I looked into the mapred job generated by hive on jobtracker I >> > > discovered that "map.input.records" counts ALL the items in Hbase >> > > table, meaning the job makes a full table scan before it even starts >> any >> > mappers!! >> > > Moreover, I suspect it copies all the data from Hbase table to hdfs to >> > > mapper tmp input folder before executuion. >> > > >> > > So, my questions are - Why hbase storage handler for hive does not >> > > translate hive queries into appropriate hbase functions? Why it scans >> > > all the records and then slices them using "where" clause? How can it >> > > be improved? Is Pig's integration better in this case? >> > > >> > > >> > > Some additional information about the tables: >> > > Table description in Hbase: >> > > jruby-1.6.2 :011 > describe 'tbl' >> > > DESCRIPTION >> > > ENABLED >> > > {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER => >> > > 'ROWCOL', REPLICATIO true >> > > N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL => >> > > '2147483647', BLOCKSIZE => >> > > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} >> > > >> > > Table desciption in Hive: >> > > hive> describe tbl; >> > > OK >> > > id string from deserializer >> > > data map<string,string> from deserializer Time taken: 0.08 seconds >> > > >> > > Best regards, >> > > Matthew Tovbin =) >> > > >> > >