Most of the benchmarks I've seen are about what you're seeing 4-5x
overhead reading from HBase vs straight DFS files.

Makes sense as we have a whole extra layer involved, plus locking
overhead, etc. We can probably do some more optimization and get down
to a 2x difference, but we'll never be as fast as churning through raw
files with no locks and no extra copies.

-Todd

On Thu, Oct 13, 2011 at 10:25 AM, Jean-Daniel Cryans
<[email protected]> wrote:
> Your question is more basic than that, it's actually how much slower is it
> to sequentially read in HBase compared to HDFS. I'm not sure anyone
> quantified that, and there's probably a bunch of factors that can influence
> it, but at least you should try to get the same level of distribution eg
> since you have less regions than mapper slots, force split that table once
> or twice to get more of them. The difference here is due to the fact that
> regions can get up to 256MB by default before splitting whereas in HDFS the
> default block size is 64MB.
>
> Then maybe your HBase schema isn't efficient (fat keys), but I wouldn't be
> able to tell just by what you wrote.
>
> In any case, since you have to go through an additional layer, it will
> definitely be slower to use HBase than directly reading the files.
>
> J-D
>
> On Thu, Oct 13, 2011 at 1:53 AM, Weihua JIANG <[email protected]>wrote:
>
>> After set this argument to 1000, I get a result: hive/hbase is 4X
>> slower than hive/hdfs.
>>
>> how much X is the expected slowdown for hive/hbase vs hive/hdfs?
>>
>> Thanks
>> Weihua
>>
>> 2011/10/12 Akash Ashok <[email protected]>:
>> > Hi,
>> > To set this parameter you could use "set
>> hbase.client.scanner.caching=500;"
>> > before the execution of your hive query.
>> >
>> > Cheers,
>> > Akash
>> >
>> > On Wed, Oct 12, 2011 at 8:34 AM, Weihua JIANG <[email protected]
>> >wrote:
>> >
>> >> Since I am using Hive to perform query, I don't know how to set it.
>> >> Can you tell me how to do so?
>> >>
>> >> Thanks
>> >> Weihua
>> >>
>> >> 2011/10/12 Jean-Daniel Cryans <[email protected]>:
>> >> > This is one big factor and you didn't mention configuring it:
>> >> > http://hbase.apache.org/book.html#perf.hbase.client.caching
>> >> >
>> >> > J-D
>> >> >
>> >> > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[email protected]
>> >> >wrote:
>> >> >
>> >> >> Hi all,
>> >> >>
>> >> >> I have made some perf test about Hive+HBase. The table is a normal 2D
>> >> >> table with about 160M rows (each row with 7 small columns) and 32
>> >> >> regions. There is only one column family and all regions have been
>> >> >> major compacted to one store file before test.
>> >> >>
>> >> >> On a cluster with 11 task trackers (each with 4 map slots and 1
>> reduce
>> >> >> slot, these servers also act as region servers), a simple SQL in Hive
>> >> >>   select count(*) from table where column3='Y';
>> >> >> needs ~1700 seconds to finish.
>> >> >>
>> >> >> But, after use CTAS statement to create an internal table (stored as
>> >> >> sequence file), this statement only needs 43 seconds to finish.
>> >> >>
>> >> >> So Hive+HBase is 40X slower than Hive+HDFS.
>> >> >>
>> >> >> Though Hive+HBase has less map tasks (32 vs 223), but since there are
>> >> >> only 44 map slots available, I don't think it is the main cause.
>> >> >>
>> >> >> I studied the source code of HBase scan implementation. To me, it
>> >> >> seems, in my case, the scan performs HFile read in a quite similar
>> way
>> >> >> as sequence file read (sequential reading of each key/value pair).
>> So,
>> >> >> in theory, the performance shall be quite similar.
>> >> >>
>> >> >> Can anyone explain the 40X slowdown?
>> >> >>
>> >> >> Thanks
>> >> >> Weihua
>> >> >>
>> >> >
>> >>
>> >
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to