(replying to user@, dev@ in BCC) AFAIK the HBase handler doesn't have the wits to understand that you are doing a prefix scan and thus limit the scan to only the required rows. There's a bunch of optimizations like that that need to be done.
I'm pretty sure Pig does the same thing, but don't take my word on it. J-D On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <matt...@tovbin.com> wrote: > Hi guys, > > I've got a table in Hbase let's say "tbl" and I would like to query it using > Hive. Therefore I mapped a table to hive as follows: > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:") > TBLPROPERTIES("hbase.table.name" = "tbl"); > > Queries like: "select * from tbl", "select id from tbl", "select id, data > from tbl" are really fast. > But queries like "select id from tbl where substr(id, 0, 5) = "12345"" > or "select id from tbl where data["777"] IS NOT NULL" are incredibly slow. > > In the contrary when running from Hbase shell: "scan 'tbl', { > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or > "scan 'tbl', { COLUMNS=>'data', "FILTER" => > FilterList.new([qualifierFilter('777')])}" > it is lightning fast! > > When I looked into the mapred job generated by hive on jobtracker I > discovered that "map.input.records" counts ALL the items in Hbase table, > meaning the job makes a full table scan before it even starts any mappers!! > Moreover, I suspect it copies all the data from Hbase table to hdfs to > mapper tmp input folder before executuion. > > So, my questions are - Why hbase storage handler for hive does not translate > hive queries into appropriate hbase functions? Why it scans all the records > and then slices them using "where" clause? How can it be improved? Is Pig's > integration better in this case? > > > Some additional information about the tables: > Table description in Hbase: > jruby-1.6.2 :011 > describe 'tbl' > DESCRIPTION > ENABLED > {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER => 'ROWCOL', > REPLICATIO true > N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL => '2147483647', > BLOCKSIZE => > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} > > Table desciption in Hive: > hive> describe tbl; > OK > id string from deserializer > data map<string,string> from deserializer > Time taken: 0.08 seconds > > Best regards, > Matthew Tovbin =) >