Sandy: Looking at patch v6 of HBASE-8420, I think it is different from your approach below for the case of cache.size() == 0.
Maybe log a JIRA for further discussion ? On Wed, May 22, 2013 at 3:33 PM, Sandy Pratt <[email protected]> wrote: > It seems to be in the ballpark of what I was getting at, but I haven't > fully digested the code yet, so I can't say for sure. > > Here's what I'm getting at. Looking at > o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I > see there are three branches with respect to the cache: > > public Result next() throws IOException { > > > // If the scanner is closed and there's nothing left in the cache, next > is a no-op. > if (cache.size() == 0 && this.closed) { > return null; > } > > if (cache.size() == 0) { > // Request more results from RS > ... > } > > if (cache.size() > 0) { > return cache.poll(); > } > > ... > return null; > > } > > > I think that middle branch wants to change as follows (pseudo-code): > > if the cache size is below a certain threshold then > initiate asynchronous action to refill it > if there is no result to return until the cache refill completes then > block > done > done > > Or something along those lines. I haven't grokked the patch well enough > yet to tell if that's what it does. What I think is happening in the > 0.94.2 code I've got is that it requests nothing until the cache is empty, > then blocks until it's non-empty. We want to eagerly and asynchronously > refill the cache so that we ideally never have to block. > > > Sandy > > > On 5/22/13 1:39 PM, "Ted Yu" <[email protected]> wrote: > > >Sandy: > >Do you think the following JIRA would help with what you expect in this > >regard ? > > > >HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb > > > >Cheers > > > >On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <[email protected]> wrote: > > > >> I found this thread on search-hadoop.com just now because I've been > >> wrestling with the same issue for a while and have as yet been unable to > >> solve it. However, I think I have an idea of the problem. My theory is > >> based on assumptions about what's going on in HBase and HDFS internally, > >> so please correct me if I'm wrong. > >> > >> Briefly, I think the issue is that sequential reads from HDFS are > >> pipelined, whereas sequential reads from HBase are not. Therefore, > >> sequential reads from HDFS tend to keep the IO subsystem saturated, > >>while > >> sequential reads from HBase allow it to idle for a relatively large > >> proportion of time. > >> > >> To make this more concrete, suppose that I'm reading N bytes of data > >>from > >> a file in HDFS. I issue the calls to open the file and begin to read > >> (from an InputStream, for example). As I'm reading byte 1 of the stream > >> at my client, the datanode is reading byte M where 1 < M <= N from disk. > >> Thus, three activities tend to happen concurrently for the most part > >> (disregarding the beginning and end of the file): 1) processing at the > >> client; 2) streaming over the network from datanode to client; and 3) > >> reading data from disk at the datanode. The proportion of time these > >> three activities overlap tends towards 100% as N -> infinity. > >> > >> Now suppose I read a batch of R records from HBase (where R = whatever > >> scanner caching happens to be). As I understand it, I issue my call to > >> ResultScanner.next(), and this causes the RegionServer to block as if > >>on a > >> page fault while it loads enough HFile blocks from disk to cover the R > >> records I (implicitly) requested. After the blocks are loaded into the > >> block cache on the RS, the RS returns R records to me over the network. > >> Then I process the R records locally. When they are exhausted, this > >>cycle > >> repeats. The notable upshot is that while the RS is faulting HFile > >>blocks > >> into the cache, my client is blocked. Furthermore, while my client is > >> processing records, the RS is idle with respect to work on behalf of my > >> client. > >> > >> That last point is really the killer, if I'm correct in my assumptions. > >> It means that Scanner caching and larger block sizes work only to > >>amortize > >> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the > >>IO > >> subsystems saturated during sequential reads. What *should* happen is > >> that the RS should treat the Scanner caching value (R above) as a hint > >> that it should always have ready records r + 1 to r + R when I'm reading > >> record r, at least up to the region boundary. The RS should be > >>preparing > >> eagerly for the next call to ResultScanner.next(), which I suspect it's > >> currently not doing. > >> > >> Another way to state this would be to say that the client should tell > >>the > >> RS to prepare the next batch of records soon enough that they can start > >>to > >> arrive at the client just as the client is finishing the current batch. > >> As is, I suspect it doesn't request more from the RS until the local > >>batch > >> is exhausted. > >> > >> As I cautioned before, this is based on assumptions about how the > >> internals work, so please correct me if I'm wrong. Also, I'm way behind > >> on the mailing list, so I probably won't see any responses unless CC'd > >> directly. > >> > >> Sandy > >> > >> On 5/10/13 8:46 AM, "Bryan Keller" <[email protected]> wrote: > >> > >> >FYI, I ran tests with compression on and off. > >> > > >> >With a plain HDFS sequence file and compression off, I am getting very > >> >good I/O numbers, roughly 75% of theoretical max for reads. With snappy > >> >compression on with a sequence file, I/O speed is about 3x slower. > >> >However the file size is 3x smaller so it takes about the same time to > >> >scan. > >> > > >> >With HBase, the results are equivalent (just much slower than a > >>sequence > >> >file). Scanning a compressed table is about 3x slower I/O than an > >> >uncompressed table, but the table is 3x smaller, so the time to scan is > >> >about the same. Scanning an HBase table takes about 3x as long as > >> >scanning the sequence file export of the table, either compressed or > >> >uncompressed. The sequence file export file size ends up being just > >> >barely larger than the table, either compressed or uncompressed > >> > > >> >So in sum, compression slows down I/O 3x, but the file is 3x smaller so > >> >the time to scan is about the same. Adding in HBase slows things down > >> >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed > >>sequence > >> >file vs scanning a compressed table. > >> > > >> > > >> >On May 8, 2013, at 10:15 AM, Bryan Keller <[email protected]> wrote: > >> > > >> >> Thanks for the offer Lars! I haven't made much progress speeding > >>things > >> >>up. > >> >> > >> >> I finally put together a test program that populates a table that is > >> >>similar to my production dataset. I have a readme that should describe > >> >>things, hopefully enough to make it useable. There is code to > >>populate a > >> >>test table, code to scan the table, and code to scan sequence files > >>from > >> >>an export (to compare HBase w/ raw HDFS). I use a gradle build script. > >> >> > >> >> You can find the code here: > >> >> > >> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip > >> >> > >> >> > >> >> On May 4, 2013, at 6:33 PM, lars hofhansl <[email protected]> wrote: > >> >> > >> >>> The blockbuffers are not reused, but that by itself should not be a > >> >>>problem as they are all the same size (at least I have never > >>identified > >> >>>that as one in my profiling sessions). > >> >>> > >> >>> My offer still stands to do some profiling myself if there is an > >>easy > >> >>>way to generate data of similar shape. > >> >>> > >> >>> -- Lars > >> >>> > >> >>> > >> >>> > >> >>> ________________________________ > >> >>> From: Bryan Keller <[email protected]> > >> >>> To: [email protected] > >> >>> Sent: Friday, May 3, 2013 3:44 AM > >> >>> Subject: Re: Poor HBase map-reduce scan performance > >> >>> > >> >>> > >> >>> Actually I'm not too confident in my results re block size, they may > >> >>>have been related to major compaction. I'm going to rerun before > >> >>>drawing any conclusions. > >> >>> > >> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <[email protected]> > wrote: > >> >>> > >> >>>> I finally made some progress. I tried a very large HBase block size > >> >>>>(16mb), and it significantly improved scan performance. I went from > >> >>>>45-50 min to 24 min. Not great but much better. Before I had it set > >>to > >> >>>>128k. Scanning an equivalent sequence file takes 10 min. My random > >> >>>>read performance will probably suffer with such a large block size > >> >>>>(theoretically), so I probably can't keep it this big. I care about > >> >>>>random read performance too. I've read having a block size this big > >>is > >> >>>>not recommended, is that correct? > >> >>>> > >> >>>> I haven't dug too deeply into the code, are the block buffers > >>reused > >> >>>>or is each new block read a new allocation? Perhaps a buffer pool > >> >>>>could help here if there isn't one already. When doing a scan, HBase > >> >>>>could reuse previously allocated block buffers instead of > >>allocating a > >> >>>>new one for each block. Then block size shouldn't affect scan > >> >>>>performance much. > >> >>>> > >> >>>> I'm not using a block encoder. Also, I'm still sifting through the > >> >>>>profiler results, I'll see if I can make more sense of it and run > >>some > >> >>>>more experiments. > >> >>>> > >> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <[email protected]> > wrote: > >> >>>> > >> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have > >> >>>>>changed that much from 0.94.4) > >> >>>>> > >> >>>>> > >> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If > >> >>>>>so, try without. They currently need to reallocate a ByteBuffer for > >> >>>>>each single KV. > >> >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably > >> >>>>>have not enabled encoding, just checking). > >> >>>>> > >> >>>>> > >> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That > >>is > >> >>>>>a strange one since it never came up in my profiling (unless you > >> >>>>>enabled block encoding). > >> >>>>> (You can get traces from VisualVM by creating a snapshot, but > >>you'd > >> >>>>>have to drill in to find the allocate()). > >> >>>>> > >> >>>>> > >> >>>>> During normal scanning (again, without encoding) there should be > >>no > >> >>>>>allocation happening except for blocks read from disk (and they > >> >>>>>should all be the same size, thus allocation should be cheap). > >> >>>>> > >> >>>>> -- Lars > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> ________________________________ > >> >>>>> From: Bryan Keller <[email protected]> > >> >>>>> To: [email protected] > >> >>>>> Sent: Thursday, May 2, 2013 10:54 AM > >> >>>>> Subject: Re: Poor HBase map-reduce scan performance > >> >>>>> > >> >>>>> > >> >>>>> I ran one of my regionservers through VisualVM. It looks like the > >> >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and > >> >>>>>ByteBuffer.allocate(). It appears at first glance that memory > >> >>>>>allocations may be an issue. Decompression was next below that but > >> >>>>>less of an issue it seems. > >> >>>>> > >> >>>>> Would changing the block size, either HDFS or HBase, help here? > >> >>>>> > >> >>>>> Also, if anyone has tips on how else to profile, that would be > >> >>>>>appreciated. VisualVM can produce a lot of noise that is hard to > >>sift > >> >>>>>through. > >> >>>>> > >> >>>>> > >> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <[email protected]> > >>wrote: > >> >>>>> > >> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion. > >> >>>>>> > >> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <[email protected]> > >>wrote: > >> >>>>>> > >> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the > >>latest > >> >>>>>>>0.94.7. > >> >>>>>>> I would be very curious to see profiling data. > >> >>>>>>> > >> >>>>>>> -- Lars > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> ----- Original Message ----- > >> >>>>>>> From: Bryan Keller <[email protected]> > >> >>>>>>> To: "[email protected]" <[email protected]> > >> >>>>>>> Cc: > >> >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM > >> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance > >> >>>>>>> > >> >>>>>>> I tried running my test with 0.94.4, unfortunately performance > >>was > >> >>>>>>>about the same. I'm planning on profiling the regionserver and > >> >>>>>>>trying some other things tonight and tomorrow and will report > >>back. > >> >>>>>>> > >> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <[email protected]> > >> wrote: > >> >>>>>>> > >> >>>>>>>> Yes I would like to try this, if you can point me to the > >>pom.xml > >> >>>>>>>>patch that would save me some time. > >> >>>>>>>> > >> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote: > >> >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the > >> >>>>>>>>amount of bytes copied around in RAM during scanning, especially > >> >>>>>>>>if you have wide rows and/or large key portions. That in turns > >> >>>>>>>>makes scans scale better across cores, since RAM is shared > >> >>>>>>>>resource between cores (much like disk). > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> It's not hard to build the latest HBase against Cloudera's > >> >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to > >> >>>>>>>>do that. > >> >>>>>>>> > >> >>>>>>>> -- Lars > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> ________________________________ > >> >>>>>>>> From: Bryan Keller <[email protected]> > >> >>>>>>>> To: [email protected] > >> >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM > >> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> The table has hashed keys so rows are evenly distributed > >>amongst > >> >>>>>>>>the regionservers, and load on each regionserver is pretty much > >> >>>>>>>>the same. I also have per-table balancing turned on. I get > >>mostly > >> >>>>>>>>data local mappers with only a few rack local (maybe 10 of the > >>250 > >> >>>>>>>>mappers). > >> >>>>>>>> > >> >>>>>>>> Currently the table is a wide table schema, with lists of data > >> >>>>>>>>structures stored as columns with column prefixes grouping the > >> >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name, > >> >>>>>>>>2_address, 2_city). I was thinking of moving those data > >>structures > >> >>>>>>>>to protobuf which would cut down on the number of columns. The > >> >>>>>>>>downside is I can't filter on one value with that, but it is a > >> >>>>>>>>tradeoff I would make for performance. I was also considering > >> >>>>>>>>restructuring the table into a tall table. > >> >>>>>>>> > >> >>>>>>>> Something interesting is that my old regionserver machines had > >> >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was > >>about > >> >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So > >> >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The > >>CPU > >> >>>>>>>>is rather high for the regionserver so it seems like the best > >> >>>>>>>>candidate to investigate. I will try profiling it tomorrow and > >> >>>>>>>>will report back. I may revisit compression on vs off since that > >> >>>>>>>>is adding load to the CPU. > >> >>>>>>>> > >> >>>>>>>> I'll also come up with a sample program that generates data > >> >>>>>>>>similar to my table. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <[email protected]> > >> >>>>>>>>wrote: > >> >>>>>>>> > >> >>>>>>>>> Your average row is 35k so scanner caching would not make a > >>huge > >> >>>>>>>>>difference, although I would have expected some improvements by > >> >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe. > >> >>>>>>>>> > >> >>>>>>>>> I assume your table is split sufficiently to touch all > >> >>>>>>>>>RegionServer... Do you see the same load/IO on all region > >>servers? > >> >>>>>>>>> > >> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2. > >> >>>>>>>>> I blogged about some of these changes here: > >> >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html > >> >>>>>>>>> > >> >>>>>>>>> In your case - since you have many columns, each of which > >>carry > >> >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279. > >> >>>>>>>>> > >> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full > >>scans. > >> >>>>>>>>>How could it not be? > >> >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's > >>is > >> >>>>>>>>>disbaled in both HBase and HDFS. > >> >>>>>>>>> > >> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe > >>Andy > >> >>>>>>>>>Purtell is listening, I think he did some tests with HBase on > >> >>>>>>>>>SSDs. > >> >>>>>>>>> With rotating media you typically see an improvement with > >> >>>>>>>>>compression. With SSDs the added CPU needed for decompression > >> >>>>>>>>>might outweigh the benefits. > >> >>>>>>>>> > >> >>>>>>>>> At the risk of starting a larger discussion here, I would > >>posit > >> >>>>>>>>>that HBase's LSM based design, which trades random IO with > >> >>>>>>>>>sequential IO, might be a bit more questionable on SSDs. > >> >>>>>>>>> > >> >>>>>>>>> If you can, it would be nice to run a profiler against one of > >> >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and > >> >>>>>>>>>see where it is bottlenecked. > >> >>>>>>>>> (And if you send me a sample program to generate some data - > >>not > >> >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the > >> >>>>>>>>>next days as my day job permits, but I do not have any machines > >> >>>>>>>>>with SSDs). > >> >>>>>>>>> > >> >>>>>>>>> -- Lars > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> ________________________________ > >> >>>>>>>>> From: Bryan Keller <[email protected]> > >> >>>>>>>>> To: [email protected] > >> >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM > >> >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have > >> >>>>>>>>>setCacheBlocks(false) > >> >>>>>>>>> > >> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <[email protected]> > >>wrote: > >> >>>>>>>>> > >> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example : > >> >>>>>>>>>> > >> >>>>>>>>>> scan.setCaching(500); // 1 is the default in Scan, > >>which > >> >>>>>>>>>>will > >> >>>>>>>>>> be bad for MapReduce jobs > >> >>>>>>>>>> scan.setCacheBlocks(false); // don't set to true for MR jobs > >> >>>>>>>>>> > >> >>>>>>>>>> I guess you have used the above setting. > >> >>>>>>>>>> > >> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading > >> >>>>>>>>>>to, say > >> >>>>>>>>>> 0.94.7 which was recently released ? > >> >>>>>>>>>> > >> >>>>>>>>>> Cheers > >> >>>>>>>>>> > >> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm > >> >>>>>>> > >> >> > >> > > >> > >> > >
