Sandy: Do you think the following JIRA would help with what you expect in this regard ?
HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb Cheers On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <[email protected]> wrote: > I found this thread on search-hadoop.com just now because I've been > wrestling with the same issue for a while and have as yet been unable to > solve it. However, I think I have an idea of the problem. My theory is > based on assumptions about what's going on in HBase and HDFS internally, > so please correct me if I'm wrong. > > Briefly, I think the issue is that sequential reads from HDFS are > pipelined, whereas sequential reads from HBase are not. Therefore, > sequential reads from HDFS tend to keep the IO subsystem saturated, while > sequential reads from HBase allow it to idle for a relatively large > proportion of time. > > To make this more concrete, suppose that I'm reading N bytes of data from > a file in HDFS. I issue the calls to open the file and begin to read > (from an InputStream, for example). As I'm reading byte 1 of the stream > at my client, the datanode is reading byte M where 1 < M <= N from disk. > Thus, three activities tend to happen concurrently for the most part > (disregarding the beginning and end of the file): 1) processing at the > client; 2) streaming over the network from datanode to client; and 3) > reading data from disk at the datanode. The proportion of time these > three activities overlap tends towards 100% as N -> infinity. > > Now suppose I read a batch of R records from HBase (where R = whatever > scanner caching happens to be). As I understand it, I issue my call to > ResultScanner.next(), and this causes the RegionServer to block as if on a > page fault while it loads enough HFile blocks from disk to cover the R > records I (implicitly) requested. After the blocks are loaded into the > block cache on the RS, the RS returns R records to me over the network. > Then I process the R records locally. When they are exhausted, this cycle > repeats. The notable upshot is that while the RS is faulting HFile blocks > into the cache, my client is blocked. Furthermore, while my client is > processing records, the RS is idle with respect to work on behalf of my > client. > > That last point is really the killer, if I'm correct in my assumptions. > It means that Scanner caching and larger block sizes work only to amortize > the fixed overhead of disk IOs and RPCs -- they do nothing to keep the IO > subsystems saturated during sequential reads. What *should* happen is > that the RS should treat the Scanner caching value (R above) as a hint > that it should always have ready records r + 1 to r + R when I'm reading > record r, at least up to the region boundary. The RS should be preparing > eagerly for the next call to ResultScanner.next(), which I suspect it's > currently not doing. > > Another way to state this would be to say that the client should tell the > RS to prepare the next batch of records soon enough that they can start to > arrive at the client just as the client is finishing the current batch. > As is, I suspect it doesn't request more from the RS until the local batch > is exhausted. > > As I cautioned before, this is based on assumptions about how the > internals work, so please correct me if I'm wrong. Also, I'm way behind > on the mailing list, so I probably won't see any responses unless CC'd > directly. > > Sandy > > On 5/10/13 8:46 AM, "Bryan Keller" <[email protected]> wrote: > > >FYI, I ran tests with compression on and off. > > > >With a plain HDFS sequence file and compression off, I am getting very > >good I/O numbers, roughly 75% of theoretical max for reads. With snappy > >compression on with a sequence file, I/O speed is about 3x slower. > >However the file size is 3x smaller so it takes about the same time to > >scan. > > > >With HBase, the results are equivalent (just much slower than a sequence > >file). Scanning a compressed table is about 3x slower I/O than an > >uncompressed table, but the table is 3x smaller, so the time to scan is > >about the same. Scanning an HBase table takes about 3x as long as > >scanning the sequence file export of the table, either compressed or > >uncompressed. The sequence file export file size ends up being just > >barely larger than the table, either compressed or uncompressed > > > >So in sum, compression slows down I/O 3x, but the file is 3x smaller so > >the time to scan is about the same. Adding in HBase slows things down > >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence > >file vs scanning a compressed table. > > > > > >On May 8, 2013, at 10:15 AM, Bryan Keller <[email protected]> wrote: > > > >> Thanks for the offer Lars! I haven't made much progress speeding things > >>up. > >> > >> I finally put together a test program that populates a table that is > >>similar to my production dataset. I have a readme that should describe > >>things, hopefully enough to make it useable. There is code to populate a > >>test table, code to scan the table, and code to scan sequence files from > >>an export (to compare HBase w/ raw HDFS). I use a gradle build script. > >> > >> You can find the code here: > >> > >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip > >> > >> > >> On May 4, 2013, at 6:33 PM, lars hofhansl <[email protected]> wrote: > >> > >>> The blockbuffers are not reused, but that by itself should not be a > >>>problem as they are all the same size (at least I have never identified > >>>that as one in my profiling sessions). > >>> > >>> My offer still stands to do some profiling myself if there is an easy > >>>way to generate data of similar shape. > >>> > >>> -- Lars > >>> > >>> > >>> > >>> ________________________________ > >>> From: Bryan Keller <[email protected]> > >>> To: [email protected] > >>> Sent: Friday, May 3, 2013 3:44 AM > >>> Subject: Re: Poor HBase map-reduce scan performance > >>> > >>> > >>> Actually I'm not too confident in my results re block size, they may > >>>have been related to major compaction. I'm going to rerun before > >>>drawing any conclusions. > >>> > >>> On May 3, 2013, at 12:17 AM, Bryan Keller <[email protected]> wrote: > >>> > >>>> I finally made some progress. I tried a very large HBase block size > >>>>(16mb), and it significantly improved scan performance. I went from > >>>>45-50 min to 24 min. Not great but much better. Before I had it set to > >>>>128k. Scanning an equivalent sequence file takes 10 min. My random > >>>>read performance will probably suffer with such a large block size > >>>>(theoretically), so I probably can't keep it this big. I care about > >>>>random read performance too. I've read having a block size this big is > >>>>not recommended, is that correct? > >>>> > >>>> I haven't dug too deeply into the code, are the block buffers reused > >>>>or is each new block read a new allocation? Perhaps a buffer pool > >>>>could help here if there isn't one already. When doing a scan, HBase > >>>>could reuse previously allocated block buffers instead of allocating a > >>>>new one for each block. Then block size shouldn't affect scan > >>>>performance much. > >>>> > >>>> I'm not using a block encoder. Also, I'm still sifting through the > >>>>profiler results, I'll see if I can make more sense of it and run some > >>>>more experiments. > >>>> > >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <[email protected]> wrote: > >>>> > >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have > >>>>>changed that much from 0.94.4) > >>>>> > >>>>> > >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If > >>>>>so, try without. They currently need to reallocate a ByteBuffer for > >>>>>each single KV. > >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably > >>>>>have not enabled encoding, just checking). > >>>>> > >>>>> > >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is > >>>>>a strange one since it never came up in my profiling (unless you > >>>>>enabled block encoding). > >>>>> (You can get traces from VisualVM by creating a snapshot, but you'd > >>>>>have to drill in to find the allocate()). > >>>>> > >>>>> > >>>>> During normal scanning (again, without encoding) there should be no > >>>>>allocation happening except for blocks read from disk (and they > >>>>>should all be the same size, thus allocation should be cheap). > >>>>> > >>>>> -- Lars > >>>>> > >>>>> > >>>>> > >>>>> ________________________________ > >>>>> From: Bryan Keller <[email protected]> > >>>>> To: [email protected] > >>>>> Sent: Thursday, May 2, 2013 10:54 AM > >>>>> Subject: Re: Poor HBase map-reduce scan performance > >>>>> > >>>>> > >>>>> I ran one of my regionservers through VisualVM. It looks like the > >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and > >>>>>ByteBuffer.allocate(). It appears at first glance that memory > >>>>>allocations may be an issue. Decompression was next below that but > >>>>>less of an issue it seems. > >>>>> > >>>>> Would changing the block size, either HDFS or HBase, help here? > >>>>> > >>>>> Also, if anyone has tips on how else to profile, that would be > >>>>>appreciated. VisualVM can produce a lot of noise that is hard to sift > >>>>>through. > >>>>> > >>>>> > >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <[email protected]> wrote: > >>>>> > >>>>>> I used exactly 0.94.4, pulled from the tag in subversion. > >>>>>> > >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <[email protected]> wrote: > >>>>>> > >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest > >>>>>>>0.94.7. > >>>>>>> I would be very curious to see profiling data. > >>>>>>> > >>>>>>> -- Lars > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ----- Original Message ----- > >>>>>>> From: Bryan Keller <[email protected]> > >>>>>>> To: "[email protected]" <[email protected]> > >>>>>>> Cc: > >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM > >>>>>>> Subject: Re: Poor HBase map-reduce scan performance > >>>>>>> > >>>>>>> I tried running my test with 0.94.4, unfortunately performance was > >>>>>>>about the same. I'm planning on profiling the regionserver and > >>>>>>>trying some other things tonight and tomorrow and will report back. > >>>>>>> > >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <[email protected]> > wrote: > >>>>>>> > >>>>>>>> Yes I would like to try this, if you can point me to the pom.xml > >>>>>>>>patch that would save me some time. > >>>>>>>> > >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote: > >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the > >>>>>>>>amount of bytes copied around in RAM during scanning, especially > >>>>>>>>if you have wide rows and/or large key portions. That in turns > >>>>>>>>makes scans scale better across cores, since RAM is shared > >>>>>>>>resource between cores (much like disk). > >>>>>>>> > >>>>>>>> > >>>>>>>> It's not hard to build the latest HBase against Cloudera's > >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to > >>>>>>>>do that. > >>>>>>>> > >>>>>>>> -- Lars > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ________________________________ > >>>>>>>> From: Bryan Keller <[email protected]> > >>>>>>>> To: [email protected] > >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM > >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance > >>>>>>>> > >>>>>>>> > >>>>>>>> The table has hashed keys so rows are evenly distributed amongst > >>>>>>>>the regionservers, and load on each regionserver is pretty much > >>>>>>>>the same. I also have per-table balancing turned on. I get mostly > >>>>>>>>data local mappers with only a few rack local (maybe 10 of the 250 > >>>>>>>>mappers). > >>>>>>>> > >>>>>>>> Currently the table is a wide table schema, with lists of data > >>>>>>>>structures stored as columns with column prefixes grouping the > >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name, > >>>>>>>>2_address, 2_city). I was thinking of moving those data structures > >>>>>>>>to protobuf which would cut down on the number of columns. The > >>>>>>>>downside is I can't filter on one value with that, but it is a > >>>>>>>>tradeoff I would make for performance. I was also considering > >>>>>>>>restructuring the table into a tall table. > >>>>>>>> > >>>>>>>> Something interesting is that my old regionserver machines had > >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was about > >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So > >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The CPU > >>>>>>>>is rather high for the regionserver so it seems like the best > >>>>>>>>candidate to investigate. I will try profiling it tomorrow and > >>>>>>>>will report back. I may revisit compression on vs off since that > >>>>>>>>is adding load to the CPU. > >>>>>>>> > >>>>>>>> I'll also come up with a sample program that generates data > >>>>>>>>similar to my table. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <[email protected]> > >>>>>>>>wrote: > >>>>>>>> > >>>>>>>>> Your average row is 35k so scanner caching would not make a huge > >>>>>>>>>difference, although I would have expected some improvements by > >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe. > >>>>>>>>> > >>>>>>>>> I assume your table is split sufficiently to touch all > >>>>>>>>>RegionServer... Do you see the same load/IO on all region servers? > >>>>>>>>> > >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2. > >>>>>>>>> I blogged about some of these changes here: > >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html > >>>>>>>>> > >>>>>>>>> In your case - since you have many columns, each of which carry > >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279. > >>>>>>>>> > >>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans. > >>>>>>>>>How could it not be? > >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is > >>>>>>>>>disbaled in both HBase and HDFS. > >>>>>>>>> > >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy > >>>>>>>>>Purtell is listening, I think he did some tests with HBase on > >>>>>>>>>SSDs. > >>>>>>>>> With rotating media you typically see an improvement with > >>>>>>>>>compression. With SSDs the added CPU needed for decompression > >>>>>>>>>might outweigh the benefits. > >>>>>>>>> > >>>>>>>>> At the risk of starting a larger discussion here, I would posit > >>>>>>>>>that HBase's LSM based design, which trades random IO with > >>>>>>>>>sequential IO, might be a bit more questionable on SSDs. > >>>>>>>>> > >>>>>>>>> If you can, it would be nice to run a profiler against one of > >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and > >>>>>>>>>see where it is bottlenecked. > >>>>>>>>> (And if you send me a sample program to generate some data - not > >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the > >>>>>>>>>next days as my day job permits, but I do not have any machines > >>>>>>>>>with SSDs). > >>>>>>>>> > >>>>>>>>> -- Lars > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ________________________________ > >>>>>>>>> From: Bryan Keller <[email protected]> > >>>>>>>>> To: [email protected] > >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM > >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Yes, I have tried various settings for setCaching() and I have > >>>>>>>>>setCacheBlocks(false) > >>>>>>>>> > >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <[email protected]> wrote: > >>>>>>>>> > >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example : > >>>>>>>>>> > >>>>>>>>>> scan.setCaching(500); // 1 is the default in Scan, which > >>>>>>>>>>will > >>>>>>>>>> be bad for MapReduce jobs > >>>>>>>>>> scan.setCacheBlocks(false); // don't set to true for MR jobs > >>>>>>>>>> > >>>>>>>>>> I guess you have used the above setting. > >>>>>>>>>> > >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading > >>>>>>>>>>to, say > >>>>>>>>>> 0.94.7 which was recently released ? > >>>>>>>>>> > >>>>>>>>>> Cheers > >>>>>>>>>> > >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm > >>>>>>> > >> > > > >
