I'd say go to Avro over protobufs in terms of redesigning your schema. With respect to CPUs, you don't say what your system looks like. Intel vs AMD , Num physical cores, what else you're running on the machine (#Mappers/Reducer slots) etc ...
In terms of the schema... How are you accessing your data? You said that you want to filter on a column value... using avro to store the address record in lets say a JSON string... write a custom filter? And people laughed at me when I said that schema design was critical and often misunderstood. ;-) (Ok the truth was that they laughed at me because I thought I looked cool wearing a plaid suit.) HTH On May 1, 2013, at 1:02 AM, Bryan Keller <[email protected]> wrote: > The table has hashed keys so rows are evenly distributed amongst the > regionservers, and load on each regionserver is pretty much the same. I also > have per-table balancing turned on. I get mostly data local mappers with only > a few rack local (maybe 10 of the 250 mappers). > > Currently the table is a wide table schema, with lists of data structures > stored as columns with column prefixes grouping the data structures (e.g. > 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of > moving those data structures to protobuf which would cut down on the number > of columns. The downside is I can't filter on one value with that, but it is > a tradeoff I would make for performance. I was also considering restructuring > the table into a tall table. > > Something interesting is that my old regionserver machines had five 15k SCSI > drives instead of 2 SSDs, and performance was about the same. Also, my old > network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear > to be the bottleneck. The CPU is rather high for the regionserver so it seems > like the best candidate to investigate. I will try profiling it tomorrow and > will report back. I may revisit compression on vs off since that is adding > load to the CPU. > > I'll also come up with a sample program that generates data similar to my > table. > > > On Apr 30, 2013, at 10:01 PM, lars hofhansl <[email protected]> wrote: > >> Your average row is 35k so scanner caching would not make a huge difference, >> although I would have expected some improvements by setting it to 10 or 50 >> since you have a wide 10ge pipe. >> >> I assume your table is split sufficiently to touch all RegionServer... Do >> you see the same load/IO on all region servers? >> >> A bunch of scan improvements went into HBase since 0.94.2. >> I blogged about some of these changes here: >> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html >> >> In your case - since you have many columns, each of which carry the rowkey - >> you might benefit a lot from HBASE-7279. >> >> In the end HBase *is* slower than straight HDFS for full scans. How could it >> not be? >> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in >> both HBase and HDFS. >> >> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is >> listening, I think he did some tests with HBase on SSDs. >> With rotating media you typically see an improvement with compression. With >> SSDs the added CPU needed for decompression might outweigh the benefits. >> >> At the risk of starting a larger discussion here, I would posit that HBase's >> LSM based design, which trades random IO with sequential IO, might be a bit >> more questionable on SSDs. >> >> If you can, it would be nice to run a profiler against one of the >> RegionServers (or maybe do it with the single RS setup) and see where it is >> bottlenecked. >> (And if you send me a sample program to generate some data - not 700g, >> though :) - I'll try to do a bit of profiling during the next days as my day >> job permits, but I do not have any machines with SSDs). >> >> -- Lars >> >> >> >> >> ________________________________ >> From: Bryan Keller <[email protected]> >> To: [email protected] >> Sent: Tuesday, April 30, 2013 9:31 PM >> Subject: Re: Poor HBase map-reduce scan performance >> >> >> Yes, I have tried various settings for setCaching() and I have >> setCacheBlocks(false) >> >> On Apr 30, 2013, at 9:17 PM, Ted Yu <[email protected]> wrote: >> >>> From http://hbase.apache.org/book.html#mapreduce.example : >>> >>> scan.setCaching(500); // 1 is the default in Scan, which will >>> be bad for MapReduce jobs >>> scan.setCacheBlocks(false); // don't set to true for MR jobs >>> >>> I guess you have used the above setting. >>> >>> 0.94.x releases are compatible. Have you considered upgrading to, say >>> 0.94.7 which was recently released ? >>> >>> Cheers >>> >>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <[email protected]> wrote: >>> >>>> I have been attempting to speed up my HBase map-reduce scans for a while >>>> now. I have tried just about everything without much luck. I'm running out >>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and >>>> Hadoop 2.0.0 (CDH4.2.1). >>>> >>>> The table I'm scanning: >>>> 20 mil rows >>>> Hundreds of columns/row >>>> Column keys can be 30-40 bytes >>>> Column values are generally not large, 1k would be on the large side >>>> 250 regions >>>> Snappy compression >>>> 8gb region size >>>> 512mb memstore flush >>>> 128k block size >>>> 700gb of data on HDFS >>>> >>>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores >>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate >>>> machine acting as namenode, HMaster, and zookeeper (single instance). I >>>> have disk local reads turned on. >>>> >>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting >>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec. >>>> >>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not >>>> really that great compared to the theoretical I/O. However this is far >>>> better than I am seeing with HBase map-reduce scans of my table. >>>> >>>> I have a simple no-op map-only job (using TableInputFormat) that scans the >>>> table and does nothing with data. This takes 45 minutes. That's about >>>> 260mb/sec read speed. This is over 5x slower than straight HDFS. >>>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster >>>> performing nearly 35% slower than a single SSD. >>>> >>>> Here are some things I have changed to no avail: >>>> Scan caching values >>>> HDFS block sizes >>>> HBase block sizes >>>> Region file sizes >>>> Memory settings >>>> GC settings >>>> Number of mappers/node >>>> Compressed vs not compressed >>>> >>>> One thing I notice is that the regionserver is using quite a bit of CPU >>>> during the map reduce job. When dumping the jstack of the process, it seems >>>> like it is usually in some type of memory allocation or decompression >>>> routine which didn't seem abnormal. >>>> >>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is >>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low. >>>> I'm on the verge of just writing the dataset out to sequence files once a >>>> day for scan purposes. Is that what others are doing? > >
