I found this thread on search-hadoop.com just now because I've been wrestling with the same issue for a while and have as yet been unable to solve it. However, I think I have an idea of the problem. My theory is based on assumptions about what's going on in HBase and HDFS internally, so please correct me if I'm wrong.
Briefly, I think the issue is that sequential reads from HDFS are pipelined, whereas sequential reads from HBase are not. Therefore, sequential reads from HDFS tend to keep the IO subsystem saturated, while sequential reads from HBase allow it to idle for a relatively large proportion of time. To make this more concrete, suppose that I'm reading N bytes of data from a file in HDFS. I issue the calls to open the file and begin to read (from an InputStream, for example). As I'm reading byte 1 of the stream at my client, the datanode is reading byte M where 1 < M <= N from disk. Thus, three activities tend to happen concurrently for the most part (disregarding the beginning and end of the file): 1) processing at the client; 2) streaming over the network from datanode to client; and 3) reading data from disk at the datanode. The proportion of time these three activities overlap tends towards 100% as N -> infinity. Now suppose I read a batch of R records from HBase (where R = whatever scanner caching happens to be). As I understand it, I issue my call to ResultScanner.next(), and this causes the RegionServer to block as if on a page fault while it loads enough HFile blocks from disk to cover the R records I (implicitly) requested. After the blocks are loaded into the block cache on the RS, the RS returns R records to me over the network. Then I process the R records locally. When they are exhausted, this cycle repeats. The notable upshot is that while the RS is faulting HFile blocks into the cache, my client is blocked. Furthermore, while my client is processing records, the RS is idle with respect to work on behalf of my client. That last point is really the killer, if I'm correct in my assumptions. It means that Scanner caching and larger block sizes work only to amortize the fixed overhead of disk IOs and RPCs -- they do nothing to keep the IO subsystems saturated during sequential reads. What *should* happen is that the RS should treat the Scanner caching value (R above) as a hint that it should always have ready records r + 1 to r + R when I'm reading record r, at least up to the region boundary. The RS should be preparing eagerly for the next call to ResultScanner.next(), which I suspect it's currently not doing. Another way to state this would be to say that the client should tell the RS to prepare the next batch of records soon enough that they can start to arrive at the client just as the client is finishing the current batch. As is, I suspect it doesn't request more from the RS until the local batch is exhausted. As I cautioned before, this is based on assumptions about how the internals work, so please correct me if I'm wrong. Also, I'm way behind on the mailing list, so I probably won't see any responses unless CC'd directly. Sandy On 5/10/13 8:46 AM, "Bryan Keller" <[email protected]> wrote: >FYI, I ran tests with compression on and off. > >With a plain HDFS sequence file and compression off, I am getting very >good I/O numbers, roughly 75% of theoretical max for reads. With snappy >compression on with a sequence file, I/O speed is about 3x slower. >However the file size is 3x smaller so it takes about the same time to >scan. > >With HBase, the results are equivalent (just much slower than a sequence >file). Scanning a compressed table is about 3x slower I/O than an >uncompressed table, but the table is 3x smaller, so the time to scan is >about the same. Scanning an HBase table takes about 3x as long as >scanning the sequence file export of the table, either compressed or >uncompressed. The sequence file export file size ends up being just >barely larger than the table, either compressed or uncompressed > >So in sum, compression slows down I/O 3x, but the file is 3x smaller so >the time to scan is about the same. Adding in HBase slows things down >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence >file vs scanning a compressed table. > > >On May 8, 2013, at 10:15 AM, Bryan Keller <[email protected]> wrote: > >> Thanks for the offer Lars! I haven't made much progress speeding things >>up. >> >> I finally put together a test program that populates a table that is >>similar to my production dataset. I have a readme that should describe >>things, hopefully enough to make it useable. There is code to populate a >>test table, code to scan the table, and code to scan sequence files from >>an export (to compare HBase w/ raw HDFS). I use a gradle build script. >> >> You can find the code here: >> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip >> >> >> On May 4, 2013, at 6:33 PM, lars hofhansl <[email protected]> wrote: >> >>> The blockbuffers are not reused, but that by itself should not be a >>>problem as they are all the same size (at least I have never identified >>>that as one in my profiling sessions). >>> >>> My offer still stands to do some profiling myself if there is an easy >>>way to generate data of similar shape. >>> >>> -- Lars >>> >>> >>> >>> ________________________________ >>> From: Bryan Keller <[email protected]> >>> To: [email protected] >>> Sent: Friday, May 3, 2013 3:44 AM >>> Subject: Re: Poor HBase map-reduce scan performance >>> >>> >>> Actually I'm not too confident in my results re block size, they may >>>have been related to major compaction. I'm going to rerun before >>>drawing any conclusions. >>> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <[email protected]> wrote: >>> >>>> I finally made some progress. I tried a very large HBase block size >>>>(16mb), and it significantly improved scan performance. I went from >>>>45-50 min to 24 min. Not great but much better. Before I had it set to >>>>128k. Scanning an equivalent sequence file takes 10 min. My random >>>>read performance will probably suffer with such a large block size >>>>(theoretically), so I probably can't keep it this big. I care about >>>>random read performance too. I've read having a block size this big is >>>>not recommended, is that correct? >>>> >>>> I haven't dug too deeply into the code, are the block buffers reused >>>>or is each new block read a new allocation? Perhaps a buffer pool >>>>could help here if there isn't one already. When doing a scan, HBase >>>>could reuse previously allocated block buffers instead of allocating a >>>>new one for each block. Then block size shouldn't affect scan >>>>performance much. >>>> >>>> I'm not using a block encoder. Also, I'm still sifting through the >>>>profiler results, I'll see if I can make more sense of it and run some >>>>more experiments. >>>> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <[email protected]> wrote: >>>> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have >>>>>changed that much from 0.94.4) >>>>> >>>>> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If >>>>>so, try without. They currently need to reallocate a ByteBuffer for >>>>>each single KV. >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably >>>>>have not enabled encoding, just checking). >>>>> >>>>> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is >>>>>a strange one since it never came up in my profiling (unless you >>>>>enabled block encoding). >>>>> (You can get traces from VisualVM by creating a snapshot, but you'd >>>>>have to drill in to find the allocate()). >>>>> >>>>> >>>>> During normal scanning (again, without encoding) there should be no >>>>>allocation happening except for blocks read from disk (and they >>>>>should all be the same size, thus allocation should be cheap). >>>>> >>>>> -- Lars >>>>> >>>>> >>>>> >>>>> ________________________________ >>>>> From: Bryan Keller <[email protected]> >>>>> To: [email protected] >>>>> Sent: Thursday, May 2, 2013 10:54 AM >>>>> Subject: Re: Poor HBase map-reduce scan performance >>>>> >>>>> >>>>> I ran one of my regionservers through VisualVM. It looks like the >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and >>>>>ByteBuffer.allocate(). It appears at first glance that memory >>>>>allocations may be an issue. Decompression was next below that but >>>>>less of an issue it seems. >>>>> >>>>> Would changing the block size, either HDFS or HBase, help here? >>>>> >>>>> Also, if anyone has tips on how else to profile, that would be >>>>>appreciated. VisualVM can produce a lot of noise that is hard to sift >>>>>through. >>>>> >>>>> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <[email protected]> wrote: >>>>> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion. >>>>>> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <[email protected]> wrote: >>>>>> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest >>>>>>>0.94.7. >>>>>>> I would be very curious to see profiling data. >>>>>>> >>>>>>> -- Lars >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>> From: Bryan Keller <[email protected]> >>>>>>> To: "[email protected]" <[email protected]> >>>>>>> Cc: >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM >>>>>>> Subject: Re: Poor HBase map-reduce scan performance >>>>>>> >>>>>>> I tried running my test with 0.94.4, unfortunately performance was >>>>>>>about the same. I'm planning on profiling the regionserver and >>>>>>>trying some other things tonight and tomorrow and will report back. >>>>>>> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <[email protected]> wrote: >>>>>>> >>>>>>>> Yes I would like to try this, if you can point me to the pom.xml >>>>>>>>patch that would save me some time. >>>>>>>> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote: >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the >>>>>>>>amount of bytes copied around in RAM during scanning, especially >>>>>>>>if you have wide rows and/or large key portions. That in turns >>>>>>>>makes scans scale better across cores, since RAM is shared >>>>>>>>resource between cores (much like disk). >>>>>>>> >>>>>>>> >>>>>>>> It's not hard to build the latest HBase against Cloudera's >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to >>>>>>>>do that. >>>>>>>> >>>>>>>> -- Lars >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ________________________________ >>>>>>>> From: Bryan Keller <[email protected]> >>>>>>>> To: [email protected] >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance >>>>>>>> >>>>>>>> >>>>>>>> The table has hashed keys so rows are evenly distributed amongst >>>>>>>>the regionservers, and load on each regionserver is pretty much >>>>>>>>the same. I also have per-table balancing turned on. I get mostly >>>>>>>>data local mappers with only a few rack local (maybe 10 of the 250 >>>>>>>>mappers). >>>>>>>> >>>>>>>> Currently the table is a wide table schema, with lists of data >>>>>>>>structures stored as columns with column prefixes grouping the >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name, >>>>>>>>2_address, 2_city). I was thinking of moving those data structures >>>>>>>>to protobuf which would cut down on the number of columns. The >>>>>>>>downside is I can't filter on one value with that, but it is a >>>>>>>>tradeoff I would make for performance. I was also considering >>>>>>>>restructuring the table into a tall table. >>>>>>>> >>>>>>>> Something interesting is that my old regionserver machines had >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was about >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The CPU >>>>>>>>is rather high for the regionserver so it seems like the best >>>>>>>>candidate to investigate. I will try profiling it tomorrow and >>>>>>>>will report back. I may revisit compression on vs off since that >>>>>>>>is adding load to the CPU. >>>>>>>> >>>>>>>> I'll also come up with a sample program that generates data >>>>>>>>similar to my table. >>>>>>>> >>>>>>>> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <[email protected]> >>>>>>>>wrote: >>>>>>>> >>>>>>>>> Your average row is 35k so scanner caching would not make a huge >>>>>>>>>difference, although I would have expected some improvements by >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe. >>>>>>>>> >>>>>>>>> I assume your table is split sufficiently to touch all >>>>>>>>>RegionServer... Do you see the same load/IO on all region servers? >>>>>>>>> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2. >>>>>>>>> I blogged about some of these changes here: >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html >>>>>>>>> >>>>>>>>> In your case - since you have many columns, each of which carry >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279. >>>>>>>>> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans. >>>>>>>>>How could it not be? >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is >>>>>>>>>disbaled in both HBase and HDFS. >>>>>>>>> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy >>>>>>>>>Purtell is listening, I think he did some tests with HBase on >>>>>>>>>SSDs. >>>>>>>>> With rotating media you typically see an improvement with >>>>>>>>>compression. With SSDs the added CPU needed for decompression >>>>>>>>>might outweigh the benefits. >>>>>>>>> >>>>>>>>> At the risk of starting a larger discussion here, I would posit >>>>>>>>>that HBase's LSM based design, which trades random IO with >>>>>>>>>sequential IO, might be a bit more questionable on SSDs. >>>>>>>>> >>>>>>>>> If you can, it would be nice to run a profiler against one of >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and >>>>>>>>>see where it is bottlenecked. >>>>>>>>> (And if you send me a sample program to generate some data - not >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the >>>>>>>>>next days as my day job permits, but I do not have any machines >>>>>>>>>with SSDs). >>>>>>>>> >>>>>>>>> -- Lars >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ________________________________ >>>>>>>>> From: Bryan Keller <[email protected]> >>>>>>>>> To: [email protected] >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance >>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have >>>>>>>>>setCacheBlocks(false) >>>>>>>>> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example : >>>>>>>>>> >>>>>>>>>> scan.setCaching(500); // 1 is the default in Scan, which >>>>>>>>>>will >>>>>>>>>> be bad for MapReduce jobs >>>>>>>>>> scan.setCacheBlocks(false); // don't set to true for MR jobs >>>>>>>>>> >>>>>>>>>> I guess you have used the above setting. >>>>>>>>>> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading >>>>>>>>>>to, say >>>>>>>>>> 0.94.7 which was recently released ? >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm >>>>>>> >> >
