Sandy:
Do you think the following JIRA would help with what you expect in this
regard ?

HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb

Cheers

On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <[email protected]> wrote:

> I found this thread on search-hadoop.com just now because I've been
> wrestling with the same issue for a while and have as yet been unable to
> solve it.  However, I think I have an idea of the problem.  My theory is
> based on assumptions about what's going on in HBase and HDFS internally,
> so please correct me if I'm wrong.
>
> Briefly, I think the issue is that sequential reads from HDFS are
> pipelined, whereas sequential reads from HBase are not.  Therefore,
> sequential reads from HDFS tend to keep the IO subsystem saturated, while
> sequential reads from HBase allow it to idle for a relatively large
> proportion of time.
>
> To make this more concrete, suppose that I'm reading N bytes of data from
> a file in HDFS.  I issue the calls to open the file and begin to read
> (from an InputStream, for example).  As I'm reading byte 1 of the stream
> at my client, the datanode is reading byte M where 1 < M <= N from disk.
> Thus, three activities tend to happen concurrently for the most part
> (disregarding the beginning and end of the file): 1) processing at the
> client; 2) streaming over the network from datanode to client; and 3)
> reading data from disk at the datanode.  The proportion of time these
> three activities overlap tends towards 100% as N -> infinity.
>
> Now suppose I read a batch of R records from HBase (where R = whatever
> scanner caching happens to be).  As I understand it, I issue my call to
> ResultScanner.next(), and this causes the RegionServer to block as if on a
> page fault while it loads enough HFile blocks from disk to cover the R
> records I (implicitly) requested.  After the blocks are loaded into the
> block cache on the RS, the RS returns R records to me over the network.
> Then I process the R records locally.  When they are exhausted, this cycle
> repeats.  The notable upshot is that while the RS is faulting HFile blocks
> into the cache, my client is blocked.  Furthermore, while my client is
> processing records, the RS is idle with respect to work on behalf of my
> client.
>
> That last point is really the killer, if I'm correct in my assumptions.
> It means that Scanner caching and larger block sizes work only to amortize
> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the IO
> subsystems saturated during sequential reads.  What *should* happen is
> that the RS should treat the Scanner caching value (R above) as a hint
> that it should always have ready records r + 1 to r + R when I'm reading
> record r, at least up to the region boundary.  The RS should be preparing
> eagerly for the next call to ResultScanner.next(), which I suspect it's
> currently not doing.
>
> Another way to state this would be to say that the client should tell the
> RS to prepare the next batch of records soon enough that they can start to
> arrive at the client just as the client is finishing the current batch.
> As is, I suspect it doesn't request more from the RS until the local batch
> is exhausted.
>
> As I cautioned before, this is based on assumptions about how the
> internals work, so please correct me if I'm wrong.  Also, I'm way behind
> on the mailing list, so I probably won't see any responses unless CC'd
> directly.
>
> Sandy
>
> On 5/10/13 8:46 AM, "Bryan Keller" <[email protected]> wrote:
>
> >FYI, I ran tests with compression on and off.
> >
> >With a plain HDFS sequence file and compression off, I am getting very
> >good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> >compression on with a sequence file, I/O speed is about 3x slower.
> >However the file size is 3x smaller so it takes about the same time to
> >scan.
> >
> >With HBase, the results are equivalent (just much slower than a sequence
> >file). Scanning a compressed table is about 3x slower I/O than an
> >uncompressed table, but the table is 3x smaller, so the time to scan is
> >about the same. Scanning an HBase table takes about 3x as long as
> >scanning the sequence file export of the table, either compressed or
> >uncompressed. The sequence file export file size ends up being just
> >barely larger than the table, either compressed or uncompressed
> >
> >So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> >the time to scan is about the same. Adding in HBase slows things down
> >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
> >file vs scanning a compressed table.
> >
> >
> >On May 8, 2013, at 10:15 AM, Bryan Keller <[email protected]> wrote:
> >
> >> Thanks for the offer Lars! I haven't made much progress speeding things
> >>up.
> >>
> >> I finally put together a test program that populates a table that is
> >>similar to my production dataset. I have a readme that should describe
> >>things, hopefully enough to make it useable. There is code to populate a
> >>test table, code to scan the table, and code to scan sequence files from
> >>an export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >>
> >> You can find the code here:
> >>
> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >>
> >>
> >> On May 4, 2013, at 6:33 PM, lars hofhansl <[email protected]> wrote:
> >>
> >>> The blockbuffers are not reused, but that by itself should not be a
> >>>problem as they are all the same size (at least I have never identified
> >>>that as one in my profiling sessions).
> >>>
> >>> My offer still stands to do some profiling myself if there is an easy
> >>>way to generate data of similar shape.
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Bryan Keller <[email protected]>
> >>> To: [email protected]
> >>> Sent: Friday, May 3, 2013 3:44 AM
> >>> Subject: Re: Poor HBase map-reduce scan performance
> >>>
> >>>
> >>> Actually I'm not too confident in my results re block size, they may
> >>>have been related to major compaction. I'm going to rerun before
> >>>drawing any conclusions.
> >>>
> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <[email protected]> wrote:
> >>>
> >>>> I finally made some progress. I tried a very large HBase block size
> >>>>(16mb), and it significantly improved scan performance. I went from
> >>>>45-50 min to 24 min. Not great but much better. Before I had it set to
> >>>>128k. Scanning an equivalent sequence file takes 10 min. My random
> >>>>read performance will probably suffer with such a large block size
> >>>>(theoretically), so I probably can't keep it this big. I care about
> >>>>random read performance too. I've read having a block size this big is
> >>>>not recommended, is that correct?
> >>>>
> >>>> I haven't dug too deeply into the code, are the block buffers reused
> >>>>or is each new block read a new allocation? Perhaps a buffer pool
> >>>>could help here if there isn't one already. When doing a scan, HBase
> >>>>could reuse previously allocated block buffers instead of allocating a
> >>>>new one for each block. Then block size shouldn't affect scan
> >>>>performance much.
> >>>>
> >>>> I'm not using a block encoder. Also, I'm still sifting through the
> >>>>profiler results, I'll see if I can make more sense of it and run some
> >>>>more experiments.
> >>>>
> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <[email protected]> wrote:
> >>>>
> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
> >>>>>changed that much from 0.94.4)
> >>>>>
> >>>>>
> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
> >>>>>so, try without. They currently need to reallocate a ByteBuffer for
> >>>>>each single KV.
> >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
> >>>>>have not enabled encoding, just checking).
> >>>>>
> >>>>>
> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is
> >>>>>a strange one since it never came up in my profiling (unless you
> >>>>>enabled block encoding).
> >>>>> (You can get traces from VisualVM by creating a snapshot, but you'd
> >>>>>have to drill in to find the allocate()).
> >>>>>
> >>>>>
> >>>>> During normal scanning (again, without encoding) there should be no
> >>>>>allocation happening except for blocks read from disk (and they
> >>>>>should all be the same size, thus allocation should be cheap).
> >>>>>
> >>>>> -- Lars
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>> From: Bryan Keller <[email protected]>
> >>>>> To: [email protected]
> >>>>> Sent: Thursday, May 2, 2013 10:54 AM
> >>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>
> >>>>>
> >>>>> I ran one of my regionservers through VisualVM. It looks like the
> >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
> >>>>>ByteBuffer.allocate(). It appears at first glance that memory
> >>>>>allocations may be an issue. Decompression was next below that but
> >>>>>less of an issue it seems.
> >>>>>
> >>>>> Would changing the block size, either HDFS or HBase, help here?
> >>>>>
> >>>>> Also, if anyone has tips on how else to profile, that would be
> >>>>>appreciated. VisualVM can produce a lot of noise that is hard to sift
> >>>>>through.
> >>>>>
> >>>>>
> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <[email protected]> wrote:
> >>>>>
> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
> >>>>>>
> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <[email protected]> wrote:
> >>>>>>
> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest
> >>>>>>>0.94.7.
> >>>>>>> I would be very curious to see profiling data.
> >>>>>>>
> >>>>>>> -- Lars
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>> From: Bryan Keller <[email protected]>
> >>>>>>> To: "[email protected]" <[email protected]>
> >>>>>>> Cc:
> >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>
> >>>>>>> I tried running my test with 0.94.4, unfortunately performance was
> >>>>>>>about the same. I'm planning on profiling the regionserver and
> >>>>>>>trying some other things tonight and tomorrow and will report back.
> >>>>>>>
> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>>> Yes I would like to try this, if you can point me to the pom.xml
> >>>>>>>>patch that would save me some time.
> >>>>>>>>
> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
> >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
> >>>>>>>>amount of bytes copied around in RAM during scanning, especially
> >>>>>>>>if you have wide rows and/or large key portions. That in turns
> >>>>>>>>makes scans scale better across cores, since RAM is shared
> >>>>>>>>resource between cores (much like disk).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It's not hard to build the latest HBase against Cloudera's
> >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to
> >>>>>>>>do that.
> >>>>>>>>
> >>>>>>>> -- Lars
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>>  From: Bryan Keller <[email protected]>
> >>>>>>>> To: [email protected]
> >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The table has hashed keys so rows are evenly distributed amongst
> >>>>>>>>the regionservers, and load on each regionserver is pretty much
> >>>>>>>>the same. I also have per-table balancing turned on. I get mostly
> >>>>>>>>data local mappers with only a few rack local (maybe 10 of the 250
> >>>>>>>>mappers).
> >>>>>>>>
> >>>>>>>> Currently the table is a wide table schema, with lists of data
> >>>>>>>>structures stored as columns with column prefixes grouping the
> >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name,
> >>>>>>>>2_address, 2_city). I was thinking of moving those data structures
> >>>>>>>>to protobuf which would cut down on the number of columns. The
> >>>>>>>>downside is I can't filter on one value with that, but it is a
> >>>>>>>>tradeoff I would make for performance. I was also considering
> >>>>>>>>restructuring the table into a tall table.
> >>>>>>>>
> >>>>>>>> Something interesting is that my old regionserver machines had
> >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was about
> >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So
> >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The CPU
> >>>>>>>>is rather high for the regionserver so it seems like the best
> >>>>>>>>candidate to investigate. I will try profiling it tomorrow and
> >>>>>>>>will report back. I may revisit compression on vs off since that
> >>>>>>>>is adding load to the CPU.
> >>>>>>>>
> >>>>>>>> I'll also come up with a sample program that generates data
> >>>>>>>>similar to my table.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <[email protected]>
> >>>>>>>>wrote:
> >>>>>>>>
> >>>>>>>>> Your average row is 35k so scanner caching would not make a huge
> >>>>>>>>>difference, although I would have expected some improvements by
> >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe.
> >>>>>>>>>
> >>>>>>>>> I assume your table is split sufficiently to touch all
> >>>>>>>>>RegionServer... Do you see the same load/IO on all region servers?
> >>>>>>>>>
> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
> >>>>>>>>> I blogged about some of these changes here:
> >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >>>>>>>>>
> >>>>>>>>> In your case - since you have many columns, each of which carry
> >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279.
> >>>>>>>>>
> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans.
> >>>>>>>>>How could it not be?
> >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is
> >>>>>>>>>disbaled in both HBase and HDFS.
> >>>>>>>>>
> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> >>>>>>>>>Purtell is listening, I think he did some tests with HBase on
> >>>>>>>>>SSDs.
> >>>>>>>>> With rotating media you typically see an improvement with
> >>>>>>>>>compression. With SSDs the added CPU needed for decompression
> >>>>>>>>>might outweigh the benefits.
> >>>>>>>>>
> >>>>>>>>> At the risk of starting a larger discussion here, I would posit
> >>>>>>>>>that HBase's LSM based design, which trades random IO with
> >>>>>>>>>sequential IO, might be a bit more questionable on SSDs.
> >>>>>>>>>
> >>>>>>>>> If you can, it would be nice to run a profiler against one of
> >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and
> >>>>>>>>>see where it is bottlenecked.
> >>>>>>>>> (And if you send me a sample program to generate some data - not
> >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the
> >>>>>>>>>next days as my day job permits, but I do not have any machines
> >>>>>>>>>with SSDs).
> >>>>>>>>>
> >>>>>>>>> -- Lars
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ________________________________
> >>>>>>>>> From: Bryan Keller <[email protected]>
> >>>>>>>>> To: [email protected]
> >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
> >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have
> >>>>>>>>>setCacheBlocks(false)
> >>>>>>>>>
> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
> >>>>>>>>>>
> >>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which
> >>>>>>>>>>will
> >>>>>>>>>> be bad for MapReduce jobs
> >>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>>>>>>>>>
> >>>>>>>>>> I guess you have used the above setting.
> >>>>>>>>>>
> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
> >>>>>>>>>>to, say
> >>>>>>>>>> 0.94.7 which was recently released ?
> >>>>>>>>>>
> >>>>>>>>>> Cheers
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
> >>>>>>>
> >>
> >
>
>

Reply via email to