Re: Poor HBase map-reduce scan performance

Ted Yu Wed, 22 May 2013 15:58:06 -0700

Sandy:
Looking at patch v6 of HBASE-8420, I think it is different from your
approach below for the case of cache.size() == 0.


Maybe log a JIRA for further discussion ?

On Wed, May 22, 2013 at 3:33 PM, Sandy Pratt <[email protected]> wrote:

> It seems to be in the ballpark of what I was getting at, but I haven't
> fully digested the code yet, so I can't say for sure.
>
> Here's what I'm getting at.  Looking at
> o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
> see there are three branches with respect to the cache:
>
> public Result next() throws IOException {
>
>
>   // If the scanner is closed and there's nothing left in the cache, next
> is a no-op.
>   if (cache.size() == 0 && this.closed) {
>     return null;
>   }
>
>   if (cache.size() == 0) {
> // Request more results from RS
>   ...
>   }
>
>   if (cache.size() > 0) {
>     return cache.poll();
>   }
>
>   ...
>   return null;
>
> }
>
>
> I think that middle branch wants to change as follows (pseudo-code):
>
> if the cache size is below a certain threshold then
>   initiate asynchronous action to refill it
>   if there is no result to return until the cache refill completes then
>     block
>   done
> done
>
> Or something along those lines.  I haven't grokked the patch well enough
> yet to tell if that's what it does.  What I think is happening in the
> 0.94.2 code I've got is that it requests nothing until the cache is empty,
> then blocks until it's non-empty.  We want to eagerly and asynchronously
> refill the cache so that we ideally never have to block.
>
>
> Sandy
>
>
> On 5/22/13 1:39 PM, "Ted Yu" <[email protected]> wrote:
>
> >Sandy:
> >Do you think the following JIRA would help with what you expect in this
> >regard ?
> >
> >HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
> >
> >Cheers
> >
> >On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <[email protected]> wrote:
> >
> >> I found this thread on search-hadoop.com just now because I've been
> >> wrestling with the same issue for a while and have as yet been unable to
> >> solve it.  However, I think I have an idea of the problem.  My theory is
> >> based on assumptions about what's going on in HBase and HDFS internally,
> >> so please correct me if I'm wrong.
> >>
> >> Briefly, I think the issue is that sequential reads from HDFS are
> >> pipelined, whereas sequential reads from HBase are not.  Therefore,
> >> sequential reads from HDFS tend to keep the IO subsystem saturated,
> >>while
> >> sequential reads from HBase allow it to idle for a relatively large
> >> proportion of time.
> >>
> >> To make this more concrete, suppose that I'm reading N bytes of data
> >>from
> >> a file in HDFS.  I issue the calls to open the file and begin to read
> >> (from an InputStream, for example).  As I'm reading byte 1 of the stream
> >> at my client, the datanode is reading byte M where 1 < M <= N from disk.
> >> Thus, three activities tend to happen concurrently for the most part
> >> (disregarding the beginning and end of the file): 1) processing at the
> >> client; 2) streaming over the network from datanode to client; and 3)
> >> reading data from disk at the datanode.  The proportion of time these
> >> three activities overlap tends towards 100% as N -> infinity.
> >>
> >> Now suppose I read a batch of R records from HBase (where R = whatever
> >> scanner caching happens to be).  As I understand it, I issue my call to
> >> ResultScanner.next(), and this causes the RegionServer to block as if
> >>on a
> >> page fault while it loads enough HFile blocks from disk to cover the R
> >> records I (implicitly) requested.  After the blocks are loaded into the
> >> block cache on the RS, the RS returns R records to me over the network.
> >> Then I process the R records locally.  When they are exhausted, this
> >>cycle
> >> repeats.  The notable upshot is that while the RS is faulting HFile
> >>blocks
> >> into the cache, my client is blocked.  Furthermore, while my client is
> >> processing records, the RS is idle with respect to work on behalf of my
> >> client.
> >>
> >> That last point is really the killer, if I'm correct in my assumptions.
> >> It means that Scanner caching and larger block sizes work only to
> >>amortize
> >> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the
> >>IO
> >> subsystems saturated during sequential reads.  What *should* happen is
> >> that the RS should treat the Scanner caching value (R above) as a hint
> >> that it should always have ready records r + 1 to r + R when I'm reading
> >> record r, at least up to the region boundary.  The RS should be
> >>preparing
> >> eagerly for the next call to ResultScanner.next(), which I suspect it's
> >> currently not doing.
> >>
> >> Another way to state this would be to say that the client should tell
> >>the
> >> RS to prepare the next batch of records soon enough that they can start
> >>to
> >> arrive at the client just as the client is finishing the current batch.
> >> As is, I suspect it doesn't request more from the RS until the local
> >>batch
> >> is exhausted.
> >>
> >> As I cautioned before, this is based on assumptions about how the
> >> internals work, so please correct me if I'm wrong.  Also, I'm way behind
> >> on the mailing list, so I probably won't see any responses unless CC'd
> >> directly.
> >>
> >> Sandy
> >>
> >> On 5/10/13 8:46 AM, "Bryan Keller" <[email protected]> wrote:
> >>
> >> >FYI, I ran tests with compression on and off.
> >> >
> >> >With a plain HDFS sequence file and compression off, I am getting very
> >> >good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> >> >compression on with a sequence file, I/O speed is about 3x slower.
> >> >However the file size is 3x smaller so it takes about the same time to
> >> >scan.
> >> >
> >> >With HBase, the results are equivalent (just much slower than a
> >>sequence
> >> >file). Scanning a compressed table is about 3x slower I/O than an
> >> >uncompressed table, but the table is 3x smaller, so the time to scan is
> >> >about the same. Scanning an HBase table takes about 3x as long as
> >> >scanning the sequence file export of the table, either compressed or
> >> >uncompressed. The sequence file export file size ends up being just
> >> >barely larger than the table, either compressed or uncompressed
> >> >
> >> >So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> >> >the time to scan is about the same. Adding in HBase slows things down
> >> >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed
> >>sequence
> >> >file vs scanning a compressed table.
> >> >
> >> >
> >> >On May 8, 2013, at 10:15 AM, Bryan Keller <[email protected]> wrote:
> >> >
> >> >> Thanks for the offer Lars! I haven't made much progress speeding
> >>things
> >> >>up.
> >> >>
> >> >> I finally put together a test program that populates a table that is
> >> >>similar to my production dataset. I have a readme that should describe
> >> >>things, hopefully enough to make it useable. There is code to
> >>populate a
> >> >>test table, code to scan the table, and code to scan sequence files
> >>from
> >> >>an export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >> >>
> >> >> You can find the code here:
> >> >>
> >> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >> >>
> >> >>
> >> >> On May 4, 2013, at 6:33 PM, lars hofhansl <[email protected]> wrote:
> >> >>
> >> >>> The blockbuffers are not reused, but that by itself should not be a
> >> >>>problem as they are all the same size (at least I have never
> >>identified
> >> >>>that as one in my profiling sessions).
> >> >>>
> >> >>> My offer still stands to do some profiling myself if there is an
> >>easy
> >> >>>way to generate data of similar shape.
> >> >>>
> >> >>> -- Lars
> >> >>>
> >> >>>
> >> >>>
> >> >>> ________________________________
> >> >>> From: Bryan Keller <[email protected]>
> >> >>> To: [email protected]
> >> >>> Sent: Friday, May 3, 2013 3:44 AM
> >> >>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>
> >> >>>
> >> >>> Actually I'm not too confident in my results re block size, they may
> >> >>>have been related to major compaction. I'm going to rerun before
> >> >>>drawing any conclusions.
> >> >>>
> >> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <[email protected]>
> wrote:
> >> >>>
> >> >>>> I finally made some progress. I tried a very large HBase block size
> >> >>>>(16mb), and it significantly improved scan performance. I went from
> >> >>>>45-50 min to 24 min. Not great but much better. Before I had it set
> >>to
> >> >>>>128k. Scanning an equivalent sequence file takes 10 min. My random
> >> >>>>read performance will probably suffer with such a large block size
> >> >>>>(theoretically), so I probably can't keep it this big. I care about
> >> >>>>random read performance too. I've read having a block size this big
> >>is
> >> >>>>not recommended, is that correct?
> >> >>>>
> >> >>>> I haven't dug too deeply into the code, are the block buffers
> >>reused
> >> >>>>or is each new block read a new allocation? Perhaps a buffer pool
> >> >>>>could help here if there isn't one already. When doing a scan, HBase
> >> >>>>could reuse previously allocated block buffers instead of
> >>allocating a
> >> >>>>new one for each block. Then block size shouldn't affect scan
> >> >>>>performance much.
> >> >>>>
> >> >>>> I'm not using a block encoder. Also, I'm still sifting through the
> >> >>>>profiler results, I'll see if I can make more sense of it and run
> >>some
> >> >>>>more experiments.
> >> >>>>
> >> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <[email protected]>
> wrote:
> >> >>>>
> >> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
> >> >>>>>changed that much from 0.94.4)
> >> >>>>>
> >> >>>>>
> >> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
> >> >>>>>so, try without. They currently need to reallocate a ByteBuffer for
> >> >>>>>each single KV.
> >> >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
> >> >>>>>have not enabled encoding, just checking).
> >> >>>>>
> >> >>>>>
> >> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That
> >>is
> >> >>>>>a strange one since it never came up in my profiling (unless you
> >> >>>>>enabled block encoding).
> >> >>>>> (You can get traces from VisualVM by creating a snapshot, but
> >>you'd
> >> >>>>>have to drill in to find the allocate()).
> >> >>>>>
> >> >>>>>
> >> >>>>> During normal scanning (again, without encoding) there should be
> >>no
> >> >>>>>allocation happening except for blocks read from disk (and they
> >> >>>>>should all be the same size, thus allocation should be cheap).
> >> >>>>>
> >> >>>>> -- Lars
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> ________________________________
> >> >>>>> From: Bryan Keller <[email protected]>
> >> >>>>> To: [email protected]
> >> >>>>> Sent: Thursday, May 2, 2013 10:54 AM
> >> >>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>
> >> >>>>>
> >> >>>>> I ran one of my regionservers through VisualVM. It looks like the
> >> >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
> >> >>>>>ByteBuffer.allocate(). It appears at first glance that memory
> >> >>>>>allocations may be an issue. Decompression was next below that but
> >> >>>>>less of an issue it seems.
> >> >>>>>
> >> >>>>> Would changing the block size, either HDFS or HBase, help here?
> >> >>>>>
> >> >>>>> Also, if anyone has tips on how else to profile, that would be
> >> >>>>>appreciated. VisualVM can produce a lot of noise that is hard to
> >>sift
> >> >>>>>through.
> >> >>>>>
> >> >>>>>
> >> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <[email protected]>
> >>wrote:
> >> >>>>>
> >> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
> >> >>>>>>
> >> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <[email protected]>
> >>wrote:
> >> >>>>>>
> >> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the
> >>latest
> >> >>>>>>>0.94.7.
> >> >>>>>>> I would be very curious to see profiling data.
> >> >>>>>>>
> >> >>>>>>> -- Lars
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> ----- Original Message -----
> >> >>>>>>> From: Bryan Keller <[email protected]>
> >> >>>>>>> To: "[email protected]" <[email protected]>
> >> >>>>>>> Cc:
> >> >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
> >> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>>>
> >> >>>>>>> I tried running my test with 0.94.4, unfortunately performance
> >>was
> >> >>>>>>>about the same. I'm planning on profiling the regionserver and
> >> >>>>>>>trying some other things tonight and tomorrow and will report
> >>back.
> >> >>>>>>>
> >> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <[email protected]>
> >> wrote:
> >> >>>>>>>
> >> >>>>>>>> Yes I would like to try this, if you can point me to the
> >>pom.xml
> >> >>>>>>>>patch that would save me some time.
> >> >>>>>>>>
> >> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
> >> >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
> >> >>>>>>>>amount of bytes copied around in RAM during scanning, especially
> >> >>>>>>>>if you have wide rows and/or large key portions. That in turns
> >> >>>>>>>>makes scans scale better across cores, since RAM is shared
> >> >>>>>>>>resource between cores (much like disk).
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> It's not hard to build the latest HBase against Cloudera's
> >> >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to
> >> >>>>>>>>do that.
> >> >>>>>>>>
> >> >>>>>>>> -- Lars
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> ________________________________
> >> >>>>>>>>  From: Bryan Keller <[email protected]>
> >> >>>>>>>> To: [email protected]
> >> >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
> >> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> The table has hashed keys so rows are evenly distributed
> >>amongst
> >> >>>>>>>>the regionservers, and load on each regionserver is pretty much
> >> >>>>>>>>the same. I also have per-table balancing turned on. I get
> >>mostly
> >> >>>>>>>>data local mappers with only a few rack local (maybe 10 of the
> >>250
> >> >>>>>>>>mappers).
> >> >>>>>>>>
> >> >>>>>>>> Currently the table is a wide table schema, with lists of data
> >> >>>>>>>>structures stored as columns with column prefixes grouping the
> >> >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name,
> >> >>>>>>>>2_address, 2_city). I was thinking of moving those data
> >>structures
> >> >>>>>>>>to protobuf which would cut down on the number of columns. The
> >> >>>>>>>>downside is I can't filter on one value with that, but it is a
> >> >>>>>>>>tradeoff I would make for performance. I was also considering
> >> >>>>>>>>restructuring the table into a tall table.
> >> >>>>>>>>
> >> >>>>>>>> Something interesting is that my old regionserver machines had
> >> >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was
> >>about
> >> >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So
> >> >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The
> >>CPU
> >> >>>>>>>>is rather high for the regionserver so it seems like the best
> >> >>>>>>>>candidate to investigate. I will try profiling it tomorrow and
> >> >>>>>>>>will report back. I may revisit compression on vs off since that
> >> >>>>>>>>is adding load to the CPU.
> >> >>>>>>>>
> >> >>>>>>>> I'll also come up with a sample program that generates data
> >> >>>>>>>>similar to my table.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <[email protected]>
> >> >>>>>>>>wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Your average row is 35k so scanner caching would not make a
> >>huge
> >> >>>>>>>>>difference, although I would have expected some improvements by
> >> >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe.
> >> >>>>>>>>>
> >> >>>>>>>>> I assume your table is split sufficiently to touch all
> >> >>>>>>>>>RegionServer... Do you see the same load/IO on all region
> >>servers?
> >> >>>>>>>>>
> >> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
> >> >>>>>>>>> I blogged about some of these changes here:
> >> >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >> >>>>>>>>>
> >> >>>>>>>>> In your case - since you have many columns, each of which
> >>carry
> >> >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279.
> >> >>>>>>>>>
> >> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full
> >>scans.
> >> >>>>>>>>>How could it not be?
> >> >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's
> >>is
> >> >>>>>>>>>disbaled in both HBase and HDFS.
> >> >>>>>>>>>
> >> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe
> >>Andy
> >> >>>>>>>>>Purtell is listening, I think he did some tests with HBase on
> >> >>>>>>>>>SSDs.
> >> >>>>>>>>> With rotating media you typically see an improvement with
> >> >>>>>>>>>compression. With SSDs the added CPU needed for decompression
> >> >>>>>>>>>might outweigh the benefits.
> >> >>>>>>>>>
> >> >>>>>>>>> At the risk of starting a larger discussion here, I would
> >>posit
> >> >>>>>>>>>that HBase's LSM based design, which trades random IO with
> >> >>>>>>>>>sequential IO, might be a bit more questionable on SSDs.
> >> >>>>>>>>>
> >> >>>>>>>>> If you can, it would be nice to run a profiler against one of
> >> >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and
> >> >>>>>>>>>see where it is bottlenecked.
> >> >>>>>>>>> (And if you send me a sample program to generate some data -
> >>not
> >> >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the
> >> >>>>>>>>>next days as my day job permits, but I do not have any machines
> >> >>>>>>>>>with SSDs).
> >> >>>>>>>>>
> >> >>>>>>>>> -- Lars
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> ________________________________
> >> >>>>>>>>> From: Bryan Keller <[email protected]>
> >> >>>>>>>>> To: [email protected]
> >> >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
> >> >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have
> >> >>>>>>>>>setCacheBlocks(false)
> >> >>>>>>>>>
> >> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <[email protected]>
> >>wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
> >> >>>>>>>>>>
> >> >>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan,
> >>which
> >> >>>>>>>>>>will
> >> >>>>>>>>>> be bad for MapReduce jobs
> >> >>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >> >>>>>>>>>>
> >> >>>>>>>>>> I guess you have used the above setting.
> >> >>>>>>>>>>
> >> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
> >> >>>>>>>>>>to, say
> >> >>>>>>>>>> 0.94.7 which was recently released ?
> >> >>>>>>>>>>
> >> >>>>>>>>>> Cheers
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
> >> >>>>>>>
> >> >>
> >> >
> >>
> >>
>
>

Re: Poor HBase map-reduce scan performance

Reply via email to