Jerry: HBase snapshot is not available in 0.92.x So you cannot use HBASE-10076 in 0.92
FYI On Thu, Jan 2, 2014 at 3:31 PM, Jerry Lam <[email protected]> wrote: > Hello Sergey and Enis, > > Thank you for the pointer! HBASE-8691 will definitely help. HBASE-10076 > (Very interesting/exciting feature by the way!) is what I need. How can I > port it to 0.92.x if it is at all possible? > > I understand that my test is not realistic however since I have only 1 > region with 1 HFile (this is by design), so there should not have any > "merge" sorted read going on. > > One thing I'm not sure is that since I use snappy compression, does the > value of the KeyValue is decompress at the region server? If yes, I think > it is quite inefficient because the decompression can be done at the client > side. Saving bandwidth saves a lot of time for the type of workload I'm > working on. > > Best Regards, > > Jerry > > > > On Thu, Jan 2, 2014 at 5:02 PM, Enis Söztutar <[email protected]> wrote: > > > Nice test! > > > > There is a couple of things here: > > > > (1) HFileReader reads only one file, versus, an HRegion reads multiple > > files (into the KeyValueHeap) to do a merge scan. So, although there is > > only one file, there is some overehead of doing a merge sort'ed read from > > multiple files in the region. For a more realistic test, you can try to > do > > the reads using HRegion directly (instead of HFileReader). The overhead > is > > not that much though in my tests. > > (2) For scanning with client API, the results have to be serialized and > > deserialized and send over the network (or loopback for local). This is > > another overhead that is not there in HfileReader. > > (3) HBase scanner RPC implementation is NOT streaming. The RPC works > like > > fetching batch size (10000) records, and cannot fully saturate the disk > and > > network pipeline. > > > > In my tests for "MapReduce over snapshot files (HBASE-8369)", I have > > measured 5x difference, because of layers (2) and (3). Please see my > slides > > at http://www.slideshare.net/enissoz/mapreduce-over-snapshots > > > > I think we can do a much better job at (3), see HBASE-8691. However, > there > > will always be "some" overhead, although it should not be 5-8x. > > > > As suggested above, in the meantime, you can take a look at the patch for > > HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see > > whether it suits your use case. > > > > Enis > > > > > > On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <[email protected] > > >wrote: > > > > > Er, using MR over snapshots, which reads files directly... > > > https://issues.apache.org/jira/browse/HBASE-8369 > > > However, it was only committed to 98. > > > There was interest in 94 port (HBASE-10076), but it never happened... > > > > > > > > > On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin < > [email protected] > > > >wrote: > > > > > > > You might be interested in using > > > > https://issues.apache.org/jira/browse/HBASE-8369 > > > > However, it was only committed to 98. > > > > There was interest in 94 port (HBASE-10076), but it never happened... > > > > > > > > > > > > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <[email protected]> > > wrote: > > > > > > > >> Hello Vladimir, > > > >> > > > >> In my use case, I guarantee that a major compaction is executed > before > > > any > > > >> scan happens because the system we build is a read only system. > There > > > will > > > >> have no deleted cells. Additionally, I only need to read from a > single > > > >> column family and therefore I don't need to access multiple HFiles. > > > >> > > > >> Filter conditions are nice to have because if I can read HFile 8x > > faster > > > >> than using HBaseClient, I can do the filter on the client side and > > still > > > >> perform faster than using HBaseClient. > > > >> > > > >> Thank you for your input! > > > >> > > > >> Jerry > > > >> > > > >> > > > >> > > > >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov > > > >> <[email protected]>wrote: > > > >> > > > >> > HBase scanner MUST guarantee correct order of KeyValues (coming > from > > > >> > different HFile's), > > > >> > filter condition+ filter condition on included column families and > > > >> > qualifiers, time range, max versions and correctly process deleted > > > >> cells. > > > >> > Direct HFileReader does nothing from the above list. > > > >> > > > > >> > Best regards, > > > >> > Vladimir Rodionov > > > >> > Principal Platform Engineer > > > >> > Carrier IQ, www.carrieriq.com > > > >> > e-mail: [email protected] > > > >> > > > > >> > ________________________________________ > > > >> > From: Jerry Lam [[email protected]] > > > >> > Sent: Thursday, January 02, 2014 7:56 AM > > > >> > To: user > > > >> > Subject: Re: Performance between HBaseClient scan and > HFileReaderV2 > > > >> > > > > >> > Hi Tom, > > > >> > > > > >> > Good point. Note that I also ran the HBaseClient performance test > > > >> several > > > >> > times (as you can see from the chart). The caching should also > > benefit > > > >> the > > > >> > second time I ran the HBaseClient performance test not just > > > benefitting > > > >> the > > > >> > HFileReaderV2 test. > > > >> > > > > >> > I still don't understand what makes the HBaseClient performs so > > poorly > > > >> in > > > >> > comparison to access directly HDFS. I can understand maybe a > factor > > > of 2 > > > >> > (even that it is too much) but a factor of 8 is quite > unreasonable. > > > >> > > > > >> > Any hint? > > > >> > > > > >> > Jerry > > > >> > > > > >> > > > > >> > > > > >> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <[email protected]> > > > wrote: > > > >> > > > > >> > > I'm also new to HBase and am not familiar with HFileReaderV2. > > > >> However, > > > >> > in > > > >> > > your description, you didn't mention anything about clearing the > > > >> linux OS > > > >> > > cache between tests. That might be why you're seeing the big > > > >> difference > > > >> > if > > > >> > > you ran the HBaseClient test first, it may have warmed the OS > > cache > > > >> and > > > >> > > then HFileReaderV2 benefited from it. Just a guess... > > > >> > > > > > >> > > -- Tom > > > >> > > > > > >> > > > > > >> > > > > > >> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam < > [email protected] > > > > > > >> > wrote: > > > >> > > > > > >> > > > Hello HBase users, > > > >> > > > > > > >> > > > I just ran a very simple performance test and would like to > see > > if > > > >> > what I > > > >> > > > experienced make sense. > > > >> > > > > > > >> > > > The experiment is as follows: > > > >> > > > - I filled a hbase region with 700MB data (each row has > roughly > > 45 > > > >> > > columns > > > >> > > > and the size is 20KB for the entire row) > > > >> > > > - I configured the region to hold 4GB (therefore no split > > occurs) > > > >> > > > - I ran compactions after the data is loaded and make sure > that > > > >> there > > > >> > is > > > >> > > > only 1 region in the table under test. > > > >> > > > - No other table exists in the hbase cluster because this is a > > DEV > > > >> > > > environment > > > >> > > > - I'm using HBase 0.92.1 > > > >> > > > > > > >> > > > The test is very basic. I use HBaseClient to scan the entire > > > region > > > >> to > > > >> > > > retrieve all rows and all columns in the table, just iterating > > all > > > >> > > KeyValue > > > >> > > > pairs until it is done. It took about 1 minute 22 sec to > > complete. > > > >> > (Note > > > >> > > > that I disable block cache and uses caching size about 10000). > > > >> > > > > > > >> > > > I ran another test using HFileReaderV2 and scan the entire > > region > > > to > > > >> > > > retrieve all rows and all columns, just iterating all keyValue > > > pairs > > > >> > > until > > > >> > > > it is done. It took 11 sec. > > > >> > > > > > > >> > > > The performance difference is dramatic (almost 8 times faster > > > using > > > >> > > > HFileReaderV2). > > > >> > > > > > > >> > > > I want to know why the difference is so big or I didn't > > configure > > > >> HBase > > > >> > > > properly. From this experiment, HDFS can deliver the data > > > >> efficiently > > > >> > so > > > >> > > it > > > >> > > > is not the bottleneck. > > > >> > > > > > > >> > > > Any help is appreciated! > > > >> > > > > > > >> > > > Jerry > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > Confidentiality Notice: The information contained in this > message, > > > >> > including any attachments hereto, may be confidential and is > > intended > > > >> to be > > > >> > read only by the individual or entity to whom this message is > > > >> addressed. If > > > >> > the reader of this message is not the intended recipient or an > agent > > > or > > > >> > designee of the intended recipient, please note that any review, > > use, > > > >> > disclosure or distribution of this message or its attachments, in > > any > > > >> form, > > > >> > is strictly prohibited. If you have received this message in > error, > > > >> please > > > >> > immediately notify the sender and/or > [email protected] > > > >> > delete or destroy any copy of this message and its attachments. > > > >> > > > > >> > > > > > > > > > > > > > > -- > > > CONFIDENTIALITY NOTICE > > > NOTICE: This message is intended for the use of the individual or > entity > > to > > > which it is addressed and may contain information that is confidential, > > > privileged and exempt from disclosure under applicable law. If the > reader > > > of this message is not the intended recipient, you are hereby notified > > that > > > any printing, copying, dissemination, distribution, disclosure or > > > forwarding of this communication is strictly prohibited. If you have > > > received this communication in error, please contact the sender > > immediately > > > and delete it from your system. Thank You. > > > > > >
