Hello Vladimir, In my use case, I guarantee that a major compaction is executed before any scan happens because the system we build is a read only system. There will have no deleted cells. Additionally, I only need to read from a single column family and therefore I don't need to access multiple HFiles.
Filter conditions are nice to have because if I can read HFile 8x faster than using HBaseClient, I can do the filter on the client side and still perform faster than using HBaseClient. Thank you for your input! Jerry On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov <[email protected]>wrote: > HBase scanner MUST guarantee correct order of KeyValues (coming from > different HFile's), > filter condition+ filter condition on included column families and > qualifiers, time range, max versions and correctly process deleted cells. > Direct HFileReader does nothing from the above list. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [email protected] > > ________________________________________ > From: Jerry Lam [[email protected]] > Sent: Thursday, January 02, 2014 7:56 AM > To: user > Subject: Re: Performance between HBaseClient scan and HFileReaderV2 > > Hi Tom, > > Good point. Note that I also ran the HBaseClient performance test several > times (as you can see from the chart). The caching should also benefit the > second time I ran the HBaseClient performance test not just benefitting the > HFileReaderV2 test. > > I still don't understand what makes the HBaseClient performs so poorly in > comparison to access directly HDFS. I can understand maybe a factor of 2 > (even that it is too much) but a factor of 8 is quite unreasonable. > > Any hint? > > Jerry > > > > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <[email protected]> wrote: > > > I'm also new to HBase and am not familiar with HFileReaderV2. However, > in > > your description, you didn't mention anything about clearing the linux OS > > cache between tests. That might be why you're seeing the big difference > if > > you ran the HBaseClient test first, it may have warmed the OS cache and > > then HFileReaderV2 benefited from it. Just a guess... > > > > -- Tom > > > > > > > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <[email protected]> > wrote: > > > > > Hello HBase users, > > > > > > I just ran a very simple performance test and would like to see if > what I > > > experienced make sense. > > > > > > The experiment is as follows: > > > - I filled a hbase region with 700MB data (each row has roughly 45 > > columns > > > and the size is 20KB for the entire row) > > > - I configured the region to hold 4GB (therefore no split occurs) > > > - I ran compactions after the data is loaded and make sure that there > is > > > only 1 region in the table under test. > > > - No other table exists in the hbase cluster because this is a DEV > > > environment > > > - I'm using HBase 0.92.1 > > > > > > The test is very basic. I use HBaseClient to scan the entire region to > > > retrieve all rows and all columns in the table, just iterating all > > KeyValue > > > pairs until it is done. It took about 1 minute 22 sec to complete. > (Note > > > that I disable block cache and uses caching size about 10000). > > > > > > I ran another test using HFileReaderV2 and scan the entire region to > > > retrieve all rows and all columns, just iterating all keyValue pairs > > until > > > it is done. It took 11 sec. > > > > > > The performance difference is dramatic (almost 8 times faster using > > > HFileReaderV2). > > > > > > I want to know why the difference is so big or I didn't configure HBase > > > properly. From this experiment, HDFS can deliver the data efficiently > so > > it > > > is not the bottleneck. > > > > > > Any help is appreciated! > > > > > > Jerry > > > > > > > > > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [email protected] and > delete or destroy any copy of this message and its attachments. >
