Re: Performance between HBaseClient scan and HFileReaderV2

Ted Yu Thu, 02 Jan 2014 15:36:59 -0800

Jerry:
HBase snapshot is not available in 0.92.x
So you cannot use HBASE-10076 in 0.92


FYI


On Thu, Jan 2, 2014 at 3:31 PM, Jerry Lam <[email protected]> wrote:

> Hello Sergey and Enis,
>
> Thank you for the pointer! HBASE-8691 will definitely help. HBASE-10076
> (Very interesting/exciting feature by the way!) is what I need. How can I
> port it to 0.92.x if it is at all possible?
>
> I understand that my test is not realistic however since I have only 1
> region with 1 HFile (this is by design), so there should not have any
> "merge" sorted read going on.
>
> One thing I'm not sure is that since I use snappy compression, does the
> value of the KeyValue is decompress at the region server? If yes, I think
> it is quite inefficient because the decompression can be done at the client
> side. Saving bandwidth saves a lot of time for the type of workload I'm
> working on.
>
> Best Regards,
>
> Jerry
>
>
>
> On Thu, Jan 2, 2014 at 5:02 PM, Enis Söztutar <[email protected]> wrote:
>
> > Nice test!
> >
> > There is a couple of things here:
> >
> >  (1) HFileReader reads only one file, versus, an HRegion reads multiple
> > files (into the KeyValueHeap) to do a merge scan. So, although there is
> > only one file, there is some overehead of doing a merge sort'ed read from
> > multiple files in the region. For a more realistic test, you can try to
> do
> > the reads using HRegion directly (instead of HFileReader). The overhead
> is
> > not that much though in my tests.
> >  (2) For scanning with client API, the results have to be serialized and
> > deserialized and send over the network (or loopback for local). This is
> > another overhead that is not there in HfileReader.
> >  (3) HBase scanner RPC implementation is NOT streaming. The RPC works
> like
> > fetching batch size (10000) records, and cannot fully saturate the disk
> and
> > network pipeline.
> >
> > In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
> > measured 5x difference, because of layers (2) and (3). Please see my
> slides
> > at http://www.slideshare.net/enissoz/mapreduce-over-snapshots
> >
> > I think we can do a much better job at (3), see HBASE-8691. However,
> there
> > will always be "some" overhead, although it should not be 5-8x.
> >
> > As suggested above, in the meantime, you can take a look at the patch for
> > HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
> > whether it suits your use case.
> >
> > Enis
> >
> >
> > On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <[email protected]
> > >wrote:
> >
> > > Er, using MR over snapshots, which reads files directly...
> > > https://issues.apache.org/jira/browse/HBASE-8369
> > > However, it was only committed to 98.
> > > There was interest in 94 port (HBASE-10076), but it never happened...
> > >
> > >
> > > On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <
> [email protected]
> > > >wrote:
> > >
> > > > You might be interested in using
> > > > https://issues.apache.org/jira/browse/HBASE-8369
> > > > However, it was only committed to 98.
> > > > There was interest in 94 port (HBASE-10076), but it never happened...
> > > >
> > > >
> > > > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <[email protected]>
> > wrote:
> > > >
> > > >> Hello Vladimir,
> > > >>
> > > >> In my use case, I guarantee that a major compaction is executed
> before
> > > any
> > > >> scan happens because the system we build is a read only system.
> There
> > > will
> > > >> have no deleted cells. Additionally, I only need to read from a
> single
> > > >> column family and therefore I don't need to access multiple HFiles.
> > > >>
> > > >> Filter conditions are nice to have because if I can read HFile 8x
> > faster
> > > >> than using HBaseClient, I can do the filter on the client side and
> > still
> > > >> perform faster than using HBaseClient.
> > > >>
> > > >> Thank you for your input!
> > > >>
> > > >> Jerry
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> > > >> <[email protected]>wrote:
> > > >>
> > > >> > HBase scanner MUST guarantee correct order of KeyValues (coming
> from
> > > >> > different HFile's),
> > > >> > filter condition+ filter condition on included column families and
> > > >> > qualifiers, time range, max versions and correctly process deleted
> > > >> cells.
> > > >> > Direct HFileReader does nothing from the above list.
> > > >> >
> > > >> > Best regards,
> > > >> > Vladimir Rodionov
> > > >> > Principal Platform Engineer
> > > >> > Carrier IQ, www.carrieriq.com
> > > >> > e-mail: [email protected]
> > > >> >
> > > >> > ________________________________________
> > > >> > From: Jerry Lam [[email protected]]
> > > >> > Sent: Thursday, January 02, 2014 7:56 AM
> > > >> > To: user
> > > >> > Subject: Re: Performance between HBaseClient scan and
> HFileReaderV2
> > > >> >
> > > >> > Hi Tom,
> > > >> >
> > > >> > Good point. Note that I also ran the HBaseClient performance test
> > > >> several
> > > >> > times (as you can see from the chart). The caching should also
> > benefit
> > > >> the
> > > >> > second time I ran the HBaseClient performance test not just
> > > benefitting
> > > >> the
> > > >> > HFileReaderV2 test.
> > > >> >
> > > >> > I still don't understand what makes the HBaseClient performs so
> > poorly
> > > >> in
> > > >> > comparison to access directly HDFS. I can understand maybe a
> factor
> > > of 2
> > > >> > (even that it is too much) but a factor of 8 is quite
> unreasonable.
> > > >> >
> > > >> > Any hint?
> > > >> >
> > > >> > Jerry
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <[email protected]>
> > > wrote:
> > > >> >
> > > >> > > I'm also new to HBase and am not familiar with HFileReaderV2.
> > > >>  However,
> > > >> > in
> > > >> > > your description, you didn't mention anything about clearing the
> > > >> linux OS
> > > >> > > cache between tests.  That might be why you're seeing the big
> > > >> difference
> > > >> > if
> > > >> > > you ran the HBaseClient test first, it may have warmed the OS
> > cache
> > > >> and
> > > >> > > then HFileReaderV2 benefited from it.  Just a guess...
> > > >> > >
> > > >> > > -- Tom
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <
> [email protected]
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hello HBase users,
> > > >> > > >
> > > >> > > > I just ran a very simple performance test and would like to
> see
> > if
> > > >> > what I
> > > >> > > > experienced make sense.
> > > >> > > >
> > > >> > > > The experiment is as follows:
> > > >> > > > - I filled a hbase region with 700MB data (each row has
> roughly
> > 45
> > > >> > > columns
> > > >> > > > and the size is 20KB for the entire row)
> > > >> > > > - I configured the region to hold 4GB (therefore no split
> > occurs)
> > > >> > > > - I ran compactions after the data is loaded and make sure
> that
> > > >> there
> > > >> > is
> > > >> > > > only 1 region in the table under test.
> > > >> > > > - No other table exists in the hbase cluster because this is a
> > DEV
> > > >> > > > environment
> > > >> > > > - I'm using HBase 0.92.1
> > > >> > > >
> > > >> > > > The test is very basic. I use HBaseClient to scan the entire
> > > region
> > > >> to
> > > >> > > > retrieve all rows and all columns in the table, just iterating
> > all
> > > >> > > KeyValue
> > > >> > > > pairs until it is done. It took about 1 minute 22 sec to
> > complete.
> > > >> > (Note
> > > >> > > > that I disable block cache and uses caching size about 10000).
> > > >> > > >
> > > >> > > > I ran another test using HFileReaderV2 and scan the entire
> > region
> > > to
> > > >> > > > retrieve all rows and all columns, just iterating all keyValue
> > > pairs
> > > >> > > until
> > > >> > > > it is done. It took 11 sec.
> > > >> > > >
> > > >> > > > The performance difference is dramatic (almost 8 times faster
> > > using
> > > >> > > > HFileReaderV2).
> > > >> > > >
> > > >> > > > I want to know why the difference is so big or I didn't
> > configure
> > > >> HBase
> > > >> > > > properly. From this experiment, HDFS can deliver the data
> > > >> efficiently
> > > >> > so
> > > >> > > it
> > > >> > > > is not the bottleneck.
> > > >> > > >
> > > >> > > > Any help is appreciated!
> > > >> > > >
> > > >> > > > Jerry
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> > Confidentiality Notice:  The information contained in this
> message,
> > > >> > including any attachments hereto, may be confidential and is
> > intended
> > > >> to be
> > > >> > read only by the individual or entity to whom this message is
> > > >> addressed. If
> > > >> > the reader of this message is not the intended recipient or an
> agent
> > > or
> > > >> > designee of the intended recipient, please note that any review,
> > use,
> > > >> > disclosure or distribution of this message or its attachments, in
> > any
> > > >> form,
> > > >> > is strictly prohibited.  If you have received this message in
> error,
> > > >> please
> > > >> > immediately notify the sender and/or
> [email protected]
> > > >> > delete or destroy any copy of this message and its attachments.
> > > >> >
> > > >>
> > > >
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>

Re: Performance between HBaseClient scan and HFileReaderV2

Reply via email to