Re: Poor HBase map-reduce scan performance

Sandy Pratt Wed, 22 May 2013 15:34:14 -0700

It seems to be in the ballpark of what I was getting at, but I haven't
fully digested the code yet, so I can't say for sure.


Here's what I'm getting at.  Looking at
o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
see there are three branches with respect to the cache:

public Result next() throws IOException {


  // If the scanner is closed and there's nothing left in the cache, next
is a no-op.
  if (cache.size() == 0 && this.closed) {
    return null;
  }

  if (cache.size() == 0) {
// Request more results from RS
  ...
  }

  if (cache.size() > 0) {
    return cache.poll();
  }

  ...
  return null;

}


I think that middle branch wants to change as follows (pseudo-code):

if the cache size is below a certain threshold then
  initiate asynchronous action to refill it
  if there is no result to return until the cache refill completes then
    block
  done
done

Or something along those lines.  I haven't grokked the patch well enough
yet to tell if that's what it does.  What I think is happening in the
0.94.2 code I've got is that it requests nothing until the cache is empty,
then blocks until it's non-empty.  We want to eagerly and asynchronously
refill the cache so that we ideally never have to block.


Sandy


On 5/22/13 1:39 PM, "Ted Yu" <[email protected]> wrote:

>Sandy:
>Do you think the following JIRA would help with what you expect in this
>regard ?
>
>HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
>
>Cheers
>
>On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <[email protected]> wrote:
>
>> I found this thread on search-hadoop.com just now because I've been
>> wrestling with the same issue for a while and have as yet been unable to
>> solve it.  However, I think I have an idea of the problem.  My theory is
>> based on assumptions about what's going on in HBase and HDFS internally,
>> so please correct me if I'm wrong.
>>
>> Briefly, I think the issue is that sequential reads from HDFS are
>> pipelined, whereas sequential reads from HBase are not.  Therefore,
>> sequential reads from HDFS tend to keep the IO subsystem saturated,
>>while
>> sequential reads from HBase allow it to idle for a relatively large
>> proportion of time.
>>
>> To make this more concrete, suppose that I'm reading N bytes of data
>>from
>> a file in HDFS.  I issue the calls to open the file and begin to read
>> (from an InputStream, for example).  As I'm reading byte 1 of the stream
>> at my client, the datanode is reading byte M where 1 < M <= N from disk.
>> Thus, three activities tend to happen concurrently for the most part
>> (disregarding the beginning and end of the file): 1) processing at the
>> client; 2) streaming over the network from datanode to client; and 3)
>> reading data from disk at the datanode.  The proportion of time these
>> three activities overlap tends towards 100% as N -> infinity.
>>
>> Now suppose I read a batch of R records from HBase (where R = whatever
>> scanner caching happens to be).  As I understand it, I issue my call to
>> ResultScanner.next(), and this causes the RegionServer to block as if
>>on a
>> page fault while it loads enough HFile blocks from disk to cover the R
>> records I (implicitly) requested.  After the blocks are loaded into the
>> block cache on the RS, the RS returns R records to me over the network.
>> Then I process the R records locally.  When they are exhausted, this
>>cycle
>> repeats.  The notable upshot is that while the RS is faulting HFile
>>blocks
>> into the cache, my client is blocked.  Furthermore, while my client is
>> processing records, the RS is idle with respect to work on behalf of my
>> client.
>>
>> That last point is really the killer, if I'm correct in my assumptions.
>> It means that Scanner caching and larger block sizes work only to
>>amortize
>> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the
>>IO
>> subsystems saturated during sequential reads.  What *should* happen is
>> that the RS should treat the Scanner caching value (R above) as a hint
>> that it should always have ready records r + 1 to r + R when I'm reading
>> record r, at least up to the region boundary.  The RS should be
>>preparing
>> eagerly for the next call to ResultScanner.next(), which I suspect it's
>> currently not doing.
>>
>> Another way to state this would be to say that the client should tell
>>the
>> RS to prepare the next batch of records soon enough that they can start
>>to
>> arrive at the client just as the client is finishing the current batch.
>> As is, I suspect it doesn't request more from the RS until the local
>>batch
>> is exhausted.
>>
>> As I cautioned before, this is based on assumptions about how the
>> internals work, so please correct me if I'm wrong.  Also, I'm way behind
>> on the mailing list, so I probably won't see any responses unless CC'd
>> directly.
>>
>> Sandy
>>
>> On 5/10/13 8:46 AM, "Bryan Keller" <[email protected]> wrote:
>>
>> >FYI, I ran tests with compression on and off.
>> >
>> >With a plain HDFS sequence file and compression off, I am getting very
>> >good I/O numbers, roughly 75% of theoretical max for reads. With snappy
>> >compression on with a sequence file, I/O speed is about 3x slower.
>> >However the file size is 3x smaller so it takes about the same time to
>> >scan.
>> >
>> >With HBase, the results are equivalent (just much slower than a
>>sequence
>> >file). Scanning a compressed table is about 3x slower I/O than an
>> >uncompressed table, but the table is 3x smaller, so the time to scan is
>> >about the same. Scanning an HBase table takes about 3x as long as
>> >scanning the sequence file export of the table, either compressed or
>> >uncompressed. The sequence file export file size ends up being just
>> >barely larger than the table, either compressed or uncompressed
>> >
>> >So in sum, compression slows down I/O 3x, but the file is 3x smaller so
>> >the time to scan is about the same. Adding in HBase slows things down
>> >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed
>>sequence
>> >file vs scanning a compressed table.
>> >
>> >
>> >On May 8, 2013, at 10:15 AM, Bryan Keller <[email protected]> wrote:
>> >
>> >> Thanks for the offer Lars! I haven't made much progress speeding
>>things
>> >>up.
>> >>
>> >> I finally put together a test program that populates a table that is
>> >>similar to my production dataset. I have a readme that should describe
>> >>things, hopefully enough to make it useable. There is code to
>>populate a
>> >>test table, code to scan the table, and code to scan sequence files
>>from
>> >>an export (to compare HBase w/ raw HDFS). I use a gradle build script.
>> >>
>> >> You can find the code here:
>> >>
>> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
>> >>
>> >>
>> >> On May 4, 2013, at 6:33 PM, lars hofhansl <[email protected]> wrote:
>> >>
>> >>> The blockbuffers are not reused, but that by itself should not be a
>> >>>problem as they are all the same size (at least I have never
>>identified
>> >>>that as one in my profiling sessions).
>> >>>
>> >>> My offer still stands to do some profiling myself if there is an
>>easy
>> >>>way to generate data of similar shape.
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>> From: Bryan Keller <[email protected]>
>> >>> To: [email protected]
>> >>> Sent: Friday, May 3, 2013 3:44 AM
>> >>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>
>> >>>
>> >>> Actually I'm not too confident in my results re block size, they may
>> >>>have been related to major compaction. I'm going to rerun before
>> >>>drawing any conclusions.
>> >>>
>> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <[email protected]> wrote:
>> >>>
>> >>>> I finally made some progress. I tried a very large HBase block size
>> >>>>(16mb), and it significantly improved scan performance. I went from
>> >>>>45-50 min to 24 min. Not great but much better. Before I had it set
>>to
>> >>>>128k. Scanning an equivalent sequence file takes 10 min. My random
>> >>>>read performance will probably suffer with such a large block size
>> >>>>(theoretically), so I probably can't keep it this big. I care about
>> >>>>random read performance too. I've read having a block size this big
>>is
>> >>>>not recommended, is that correct?
>> >>>>
>> >>>> I haven't dug too deeply into the code, are the block buffers
>>reused
>> >>>>or is each new block read a new allocation? Perhaps a buffer pool
>> >>>>could help here if there isn't one already. When doing a scan, HBase
>> >>>>could reuse previously allocated block buffers instead of
>>allocating a
>> >>>>new one for each block. Then block size shouldn't affect scan
>> >>>>performance much.
>> >>>>
>> >>>> I'm not using a block encoder. Also, I'm still sifting through the
>> >>>>profiler results, I'll see if I can make more sense of it and run
>>some
>> >>>>more experiments.
>> >>>>
>> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <[email protected]> wrote:
>> >>>>
>> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
>> >>>>>changed that much from 0.94.4)
>> >>>>>
>> >>>>>
>> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
>> >>>>>so, try without. They currently need to reallocate a ByteBuffer for
>> >>>>>each single KV.
>> >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
>> >>>>>have not enabled encoding, just checking).
>> >>>>>
>> >>>>>
>> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That
>>is
>> >>>>>a strange one since it never came up in my profiling (unless you
>> >>>>>enabled block encoding).
>> >>>>> (You can get traces from VisualVM by creating a snapshot, but
>>you'd
>> >>>>>have to drill in to find the allocate()).
>> >>>>>
>> >>>>>
>> >>>>> During normal scanning (again, without encoding) there should be
>>no
>> >>>>>allocation happening except for blocks read from disk (and they
>> >>>>>should all be the same size, thus allocation should be cheap).
>> >>>>>
>> >>>>> -- Lars
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> ________________________________
>> >>>>> From: Bryan Keller <[email protected]>
>> >>>>> To: [email protected]
>> >>>>> Sent: Thursday, May 2, 2013 10:54 AM
>> >>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>
>> >>>>>
>> >>>>> I ran one of my regionservers through VisualVM. It looks like the
>> >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
>> >>>>>ByteBuffer.allocate(). It appears at first glance that memory
>> >>>>>allocations may be an issue. Decompression was next below that but
>> >>>>>less of an issue it seems.
>> >>>>>
>> >>>>> Would changing the block size, either HDFS or HBase, help here?
>> >>>>>
>> >>>>> Also, if anyone has tips on how else to profile, that would be
>> >>>>>appreciated. VisualVM can produce a lot of noise that is hard to
>>sift
>> >>>>>through.
>> >>>>>
>> >>>>>
>> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <[email protected]>
>>wrote:
>> >>>>>
>> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>> >>>>>>
>> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <[email protected]>
>>wrote:
>> >>>>>>
>> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the
>>latest
>> >>>>>>>0.94.7.
>> >>>>>>> I would be very curious to see profiling data.
>> >>>>>>>
>> >>>>>>> -- Lars
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> ----- Original Message -----
>> >>>>>>> From: Bryan Keller <[email protected]>
>> >>>>>>> To: "[email protected]" <[email protected]>
>> >>>>>>> Cc:
>> >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>>>
>> >>>>>>> I tried running my test with 0.94.4, unfortunately performance
>>was
>> >>>>>>>about the same. I'm planning on profiling the regionserver and
>> >>>>>>>trying some other things tonight and tomorrow and will report
>>back.
>> >>>>>>>
>> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <[email protected]>
>> wrote:
>> >>>>>>>
>> >>>>>>>> Yes I would like to try this, if you can point me to the
>>pom.xml
>> >>>>>>>>patch that would save me some time.
>> >>>>>>>>
>> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>> >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
>> >>>>>>>>amount of bytes copied around in RAM during scanning, especially
>> >>>>>>>>if you have wide rows and/or large key portions. That in turns
>> >>>>>>>>makes scans scale better across cores, since RAM is shared
>> >>>>>>>>resource between cores (much like disk).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> It's not hard to build the latest HBase against Cloudera's
>> >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to
>> >>>>>>>>do that.
>> >>>>>>>>
>> >>>>>>>> -- Lars
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> ________________________________
>> >>>>>>>>  From: Bryan Keller <[email protected]>
>> >>>>>>>> To: [email protected]
>> >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> The table has hashed keys so rows are evenly distributed
>>amongst
>> >>>>>>>>the regionservers, and load on each regionserver is pretty much
>> >>>>>>>>the same. I also have per-table balancing turned on. I get
>>mostly
>> >>>>>>>>data local mappers with only a few rack local (maybe 10 of the
>>250
>> >>>>>>>>mappers).
>> >>>>>>>>
>> >>>>>>>> Currently the table is a wide table schema, with lists of data
>> >>>>>>>>structures stored as columns with column prefixes grouping the
>> >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name,
>> >>>>>>>>2_address, 2_city). I was thinking of moving those data
>>structures
>> >>>>>>>>to protobuf which would cut down on the number of columns. The
>> >>>>>>>>downside is I can't filter on one value with that, but it is a
>> >>>>>>>>tradeoff I would make for performance. I was also considering
>> >>>>>>>>restructuring the table into a tall table.
>> >>>>>>>>
>> >>>>>>>> Something interesting is that my old regionserver machines had
>> >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was
>>about
>> >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So
>> >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The
>>CPU
>> >>>>>>>>is rather high for the regionserver so it seems like the best
>> >>>>>>>>candidate to investigate. I will try profiling it tomorrow and
>> >>>>>>>>will report back. I may revisit compression on vs off since that
>> >>>>>>>>is adding load to the CPU.
>> >>>>>>>>
>> >>>>>>>> I'll also come up with a sample program that generates data
>> >>>>>>>>similar to my table.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <[email protected]>
>> >>>>>>>>wrote:
>> >>>>>>>>
>> >>>>>>>>> Your average row is 35k so scanner caching would not make a
>>huge
>> >>>>>>>>>difference, although I would have expected some improvements by
>> >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe.
>> >>>>>>>>>
>> >>>>>>>>> I assume your table is split sufficiently to touch all
>> >>>>>>>>>RegionServer... Do you see the same load/IO on all region
>>servers?
>> >>>>>>>>>
>> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>> >>>>>>>>> I blogged about some of these changes here:
>> >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>> >>>>>>>>>
>> >>>>>>>>> In your case - since you have many columns, each of which
>>carry
>> >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279.
>> >>>>>>>>>
>> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full
>>scans.
>> >>>>>>>>>How could it not be?
>> >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's
>>is
>> >>>>>>>>>disbaled in both HBase and HDFS.
>> >>>>>>>>>
>> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe
>>Andy
>> >>>>>>>>>Purtell is listening, I think he did some tests with HBase on
>> >>>>>>>>>SSDs.
>> >>>>>>>>> With rotating media you typically see an improvement with
>> >>>>>>>>>compression. With SSDs the added CPU needed for decompression
>> >>>>>>>>>might outweigh the benefits.
>> >>>>>>>>>
>> >>>>>>>>> At the risk of starting a larger discussion here, I would
>>posit
>> >>>>>>>>>that HBase's LSM based design, which trades random IO with
>> >>>>>>>>>sequential IO, might be a bit more questionable on SSDs.
>> >>>>>>>>>
>> >>>>>>>>> If you can, it would be nice to run a profiler against one of
>> >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and
>> >>>>>>>>>see where it is bottlenecked.
>> >>>>>>>>> (And if you send me a sample program to generate some data -
>>not
>> >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the
>> >>>>>>>>>next days as my day job permits, but I do not have any machines
>> >>>>>>>>>with SSDs).
>> >>>>>>>>>
>> >>>>>>>>> -- Lars
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> ________________________________
>> >>>>>>>>> From: Bryan Keller <[email protected]>
>> >>>>>>>>> To: [email protected]
>> >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>> >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have
>> >>>>>>>>>setCacheBlocks(false)
>> >>>>>>>>>
>> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <[email protected]>
>>wrote:
>> >>>>>>>>>
>> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>> >>>>>>>>>>
>> >>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan,
>>which
>> >>>>>>>>>>will
>> >>>>>>>>>> be bad for MapReduce jobs
>> >>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>> >>>>>>>>>>
>> >>>>>>>>>> I guess you have used the above setting.
>> >>>>>>>>>>
>> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
>> >>>>>>>>>>to, say
>> >>>>>>>>>> 0.94.7 which was recently released ?
>> >>>>>>>>>>
>> >>>>>>>>>> Cheers
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>> >>>>>>>
>> >>
>> >
>>
>>

Re: Poor HBase map-reduce scan performance

Reply via email to