What do you have HBASE_HEAPSIZE set to in hbase-env.sh? Is it possible that you're overcommitting memory and the instance is swapping? Just a shot in the dark, but I see that the m3.2xlarge instance has 30G of memory vs. 15G for c3.2xlarge.
On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <[email protected]> wrote: > bq. there's almost no activity on either side > > During this period, can you capture stack trace for the region server and > pastebin the stack ? > > Cheers > > On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <[email protected]> > wrote: > >> Hi, everyone. Here's a strange one, at least to me. >> >> I'm doing some performance profiling, and as a rudimentary test I've >> been using YCSB to drive HBase (originally 0.98.3, recently updated to >> 0.98.6.) The problem happens on a few different instance sizes, but >> this is probably the closest comparison... >> >> On m3.2xlarge instances, works as expected. >> On c3.2xlarge instances, HBase barely responds at all during workloads >> that involve read activity, falling silent for ~62 second intervals, >> with the YCSB throughput output resembling: >> >> 0 sec: 0 operations; >> 2 sec: 918 operations; 459 current ops/sec; [UPDATE >> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26] >> 4 sec: 918 operations; 0 current ops/sec; >> 6 sec: 918 operations; 0 current ops/sec; >> <snip> >> 62 sec: 918 operations; 0 current ops/sec; >> 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE >> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56] >> 66 sec: 5302 operations; 0 current ops/sec; >> 68 sec: 5302 operations; 0 current ops/sec; >> (And so on...) >> >> While that happens there's almost no activity on either side, the CPU's >> and disks are idle, no iowait at all. >> >> There isn't much that jumps out at me when digging through the Hadoop >> and HBase logs, except that those 62-second intervals are often (but >> note always) associated with ClosedChannelExceptions in the regionserver >> logs. But I believe that's just HBase finding that a TCP connection it >> wants to reply on had been closed. >> >> As far as I've seen this happens every time on this or any of the larger >> c3 class of instances, surprisingly. The m3 instance class sizes all >> seem to work fine. These are built with a custom AMI that has HBase and >> all installed, and run via a script, so the different instance type >> should be the only difference between them. >> >> Anyone seen anything like this? Any pointers as to what I could look at >> to help diagnose this odd problem? Could there be something I'm >> overlooking in the logs? >> >> Thanks! >> >> -- Josh >> >> >>
