Hello guys,

Any updates on the issue? Anyone?! ;))

Best regards,
   Matthew Tovbin =)



On Tue, Sep 20, 2011 at 09:41, Matthew Tovbin <matt...@tovbin.com> wrote:

>  Thanks Jean and Sandy.
>
>    I have hive 0.7.1, and according to this patch
> https://issues.apache.org/jira/browse/HIVE-1226 at least exact match
> queries like  "...where id = '12345'-123' " or partial pushdown "...where id
> like "12345%" should work, but I didn't notice it.
>
> Matthew.
>
>
>
> On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <prat...@adobe.com> wrote:
>
>> I suffered the same let down a little while ago.  I believe this is the
>> relevant JIRA:
>>
>> https://issues.apache.org/jira/browse/HIVE-1643
>>
>> I'd also like to see Hive be able to limit scans to particular HBase
>> version ranges, but I don't know if that's even planned.
>>
>> Sandy
>>
>> > -----Original Message-----
>> > From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-
>> > Daniel Cryans
>> > Sent: Monday, September 19, 2011 09:58
>> > To: user@hbase.apache.org
>> > Subject: Re: Hbase-Hive integration performance issues
>> >
>> > (replying to user@, dev@ in BCC)
>> >
>> > AFAIK the HBase handler doesn't have the wits to understand that you are
>> > doing a prefix scan and thus limit the scan to only the required rows.
>> There's
>> > a bunch of optimizations like that that need to be done.
>> >
>> > I'm pretty sure Pig does the same thing, but don't take my word on it.
>> >
>> > J-D
>> >
>> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <matt...@tovbin.com>
>> > wrote:
>> > > Hi guys,
>> > >
>> > > I've got a table in Hbase let's say "tbl" and I would like to query it
>> > > using Hive. Therefore I mapped a table to hive as follows:
>> > >
>> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) STORED
>> > > BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
>> > > TBLPROPERTIES("hbase.table.name" = "tbl");
>> > >
>> > > Queries like: "select * from tbl", "select id from tbl", "select id,
>> > > data from tbl" are really fast.
>> > > But queries like "select id from tbl where substr(id, 0, 5) = "12345""
>> > > or "select id from tbl where data["777"] IS NOT NULL" are incredibly
>> slow.
>> > >
>> > > In the contrary when running from Hbase shell: "scan 'tbl', {
>> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', {
>> > > COLUMNS=>'data', "FILTER" =>
>> > > FilterList.new([qualifierFilter('777')])}"
>> > > it is lightning fast!
>> > >
>> > > When I looked into the mapred job generated by hive on jobtracker I
>> > > discovered that "map.input.records" counts ALL the items in Hbase
>> > > table, meaning the job makes a full table scan before it even starts
>> any
>> > mappers!!
>> > > Moreover, I suspect it copies all the data from Hbase table to hdfs to
>> > > mapper tmp input folder before executuion.
>> > >
>> > > So, my questions are - Why hbase storage handler for hive does not
>> > > translate hive queries into appropriate hbase functions? Why it scans
>> > > all the records and then slices them using "where" clause? How can it
>> > > be improved? Is Pig's integration better in this case?
>> > >
>> > >
>> > > Some additional information about the tables:
>> > > Table description in Hbase:
>> > > jruby-1.6.2 :011 >   describe 'tbl'
>> > > DESCRIPTION
>> > >              ENABLED
>> > >  {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER =>
>> > > 'ROWCOL', REPLICATIO true
>> > >  N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL =>
>> > > '2147483647', BLOCKSIZE =>
>> > >  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>> > >
>> > > Table desciption in Hive:
>> > > hive> describe tbl;
>> > > OK
>> > > id string from deserializer
>> > > data map<string,string> from deserializer Time taken: 0.08 seconds
>> > >
>> > > Best regards,
>> > >   Matthew Tovbin =)
>> > >
>>
>
>

Reply via email to