Re: Performance between HBaseClient scan and HFileReaderV2

Jerry Lam Thu, 02 Jan 2014 15:32:59 -0800

Hello Sergey and Enis,

Thank you for the pointer! HBASE-8691 will definitely help. HBASE-10076
(Very interesting/exciting feature by the way!) is what I need. How can I
port it to 0.92.x if it is at all possible?


I understand that my test is not realistic however since I have only 1
region with 1 HFile (this is by design), so there should not have any
"merge" sorted read going on.

One thing I'm not sure is that since I use snappy compression, does the
value of the KeyValue is decompress at the region server? If yes, I think
it is quite inefficient because the decompression can be done at the client
side. Saving bandwidth saves a lot of time for the type of workload I'm
working on.

Best Regards,

Jerry



On Thu, Jan 2, 2014 at 5:02 PM, Enis Söztutar <[email protected]> wrote:

> Nice test!
>
> There is a couple of things here:
>
>  (1) HFileReader reads only one file, versus, an HRegion reads multiple
> files (into the KeyValueHeap) to do a merge scan. So, although there is
> only one file, there is some overehead of doing a merge sort'ed read from
> multiple files in the region. For a more realistic test, you can try to do
> the reads using HRegion directly (instead of HFileReader). The overhead is
> not that much though in my tests.
>  (2) For scanning with client API, the results have to be serialized and
> deserialized and send over the network (or loopback for local). This is
> another overhead that is not there in HfileReader.
>  (3) HBase scanner RPC implementation is NOT streaming. The RPC works like
> fetching batch size (10000) records, and cannot fully saturate the disk and
> network pipeline.
>
> In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
> measured 5x difference, because of layers (2) and (3). Please see my slides
> at http://www.slideshare.net/enissoz/mapreduce-over-snapshots
>
> I think we can do a much better job at (3), see HBASE-8691. However, there
> will always be "some" overhead, although it should not be 5-8x.
>
> As suggested above, in the meantime, you can take a look at the patch for
> HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
> whether it suits your use case.
>
> Enis
>
>
> On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <[email protected]
> >wrote:
>
> > Er, using MR over snapshots, which reads files directly...
> > https://issues.apache.org/jira/browse/HBASE-8369
> > However, it was only committed to 98.
> > There was interest in 94 port (HBASE-10076), but it never happened...
> >
> >
> > On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <[email protected]
> > >wrote:
> >
> > > You might be interested in using
> > > https://issues.apache.org/jira/browse/HBASE-8369
> > > However, it was only committed to 98.
> > > There was interest in 94 port (HBASE-10076), but it never happened...
> > >
> > >
> > > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <[email protected]>
> wrote:
> > >
> > >> Hello Vladimir,
> > >>
> > >> In my use case, I guarantee that a major compaction is executed before
> > any
> > >> scan happens because the system we build is a read only system. There
> > will
> > >> have no deleted cells. Additionally, I only need to read from a single
> > >> column family and therefore I don't need to access multiple HFiles.
> > >>
> > >> Filter conditions are nice to have because if I can read HFile 8x
> faster
> > >> than using HBaseClient, I can do the filter on the client side and
> still
> > >> perform faster than using HBaseClient.
> > >>
> > >> Thank you for your input!
> > >>
> > >> Jerry
> > >>
> > >>
> > >>
> > >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> > >> <[email protected]>wrote:
> > >>
> > >> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> > >> > different HFile's),
> > >> > filter condition+ filter condition on included column families and
> > >> > qualifiers, time range, max versions and correctly process deleted
> > >> cells.
> > >> > Direct HFileReader does nothing from the above list.
> > >> >
> > >> > Best regards,
> > >> > Vladimir Rodionov
> > >> > Principal Platform Engineer
> > >> > Carrier IQ, www.carrieriq.com
> > >> > e-mail: [email protected]
> > >> >
> > >> > ________________________________________
> > >> > From: Jerry Lam [[email protected]]
> > >> > Sent: Thursday, January 02, 2014 7:56 AM
> > >> > To: user
> > >> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
> > >> >
> > >> > Hi Tom,
> > >> >
> > >> > Good point. Note that I also ran the HBaseClient performance test
> > >> several
> > >> > times (as you can see from the chart). The caching should also
> benefit
> > >> the
> > >> > second time I ran the HBaseClient performance test not just
> > benefitting
> > >> the
> > >> > HFileReaderV2 test.
> > >> >
> > >> > I still don't understand what makes the HBaseClient performs so
> poorly
> > >> in
> > >> > comparison to access directly HDFS. I can understand maybe a factor
> > of 2
> > >> > (even that it is too much) but a factor of 8 is quite unreasonable.
> > >> >
> > >> > Any hint?
> > >> >
> > >> > Jerry
> > >> >
> > >> >
> > >> >
> > >> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <[email protected]>
> > wrote:
> > >> >
> > >> > > I'm also new to HBase and am not familiar with HFileReaderV2.
> > >>  However,
> > >> > in
> > >> > > your description, you didn't mention anything about clearing the
> > >> linux OS
> > >> > > cache between tests.  That might be why you're seeing the big
> > >> difference
> > >> > if
> > >> > > you ran the HBaseClient test first, it may have warmed the OS
> cache
> > >> and
> > >> > > then HFileReaderV2 benefited from it.  Just a guess...
> > >> > >
> > >> > > -- Tom
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <[email protected]
> >
> > >> > wrote:
> > >> > >
> > >> > > > Hello HBase users,
> > >> > > >
> > >> > > > I just ran a very simple performance test and would like to see
> if
> > >> > what I
> > >> > > > experienced make sense.
> > >> > > >
> > >> > > > The experiment is as follows:
> > >> > > > - I filled a hbase region with 700MB data (each row has roughly
> 45
> > >> > > columns
> > >> > > > and the size is 20KB for the entire row)
> > >> > > > - I configured the region to hold 4GB (therefore no split
> occurs)
> > >> > > > - I ran compactions after the data is loaded and make sure that
> > >> there
> > >> > is
> > >> > > > only 1 region in the table under test.
> > >> > > > - No other table exists in the hbase cluster because this is a
> DEV
> > >> > > > environment
> > >> > > > - I'm using HBase 0.92.1
> > >> > > >
> > >> > > > The test is very basic. I use HBaseClient to scan the entire
> > region
> > >> to
> > >> > > > retrieve all rows and all columns in the table, just iterating
> all
> > >> > > KeyValue
> > >> > > > pairs until it is done. It took about 1 minute 22 sec to
> complete.
> > >> > (Note
> > >> > > > that I disable block cache and uses caching size about 10000).
> > >> > > >
> > >> > > > I ran another test using HFileReaderV2 and scan the entire
> region
> > to
> > >> > > > retrieve all rows and all columns, just iterating all keyValue
> > pairs
> > >> > > until
> > >> > > > it is done. It took 11 sec.
> > >> > > >
> > >> > > > The performance difference is dramatic (almost 8 times faster
> > using
> > >> > > > HFileReaderV2).
> > >> > > >
> > >> > > > I want to know why the difference is so big or I didn't
> configure
> > >> HBase
> > >> > > > properly. From this experiment, HDFS can deliver the data
> > >> efficiently
> > >> > so
> > >> > > it
> > >> > > > is not the bottleneck.
> > >> > > >
> > >> > > > Any help is appreciated!
> > >> > > >
> > >> > > > Jerry
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> > Confidentiality Notice:  The information contained in this message,
> > >> > including any attachments hereto, may be confidential and is
> intended
> > >> to be
> > >> > read only by the individual or entity to whom this message is
> > >> addressed. If
> > >> > the reader of this message is not the intended recipient or an agent
> > or
> > >> > designee of the intended recipient, please note that any review,
> use,
> > >> > disclosure or distribution of this message or its attachments, in
> any
> > >> form,
> > >> > is strictly prohibited.  If you have received this message in error,
> > >> please
> > >> > immediately notify the sender and/or [email protected]
> > >> > delete or destroy any copy of this message and its attachments.
> > >> >
> > >>
> > >
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

Re: Performance between HBaseClient scan and HFileReaderV2

Reply via email to