Hi, everyone. Here's a strange one, at least to me. I'm doing some performance profiling, and as a rudimentary test I've been using YCSB to drive HBase (originally 0.98.3, recently updated to 0.98.6.) The problem happens on a few different instance sizes, but this is probably the closest comparison...
On m3.2xlarge instances, works as expected. On c3.2xlarge instances, HBase barely responds at all during workloads that involve read activity, falling silent for ~62 second intervals, with the YCSB throughput output resembling: 0 sec: 0 operations; 2 sec: 918 operations; 459 current ops/sec; [UPDATE AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26] 4 sec: 918 operations; 0 current ops/sec; 6 sec: 918 operations; 0 current ops/sec; <snip> 62 sec: 918 operations; 0 current ops/sec; 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56] 66 sec: 5302 operations; 0 current ops/sec; 68 sec: 5302 operations; 0 current ops/sec; (And so on...) While that happens there's almost no activity on either side, the CPU's and disks are idle, no iowait at all. There isn't much that jumps out at me when digging through the Hadoop and HBase logs, except that those 62-second intervals are often (but note always) associated with ClosedChannelExceptions in the regionserver logs. But I believe that's just HBase finding that a TCP connection it wants to reply on had been closed. As far as I've seen this happens every time on this or any of the larger c3 class of instances, surprisingly. The m3 instance class sizes all seem to work fine. These are built with a custom AMI that has HBase and all installed, and run via a script, so the different instance type should be the only difference between them. Anyone seen anything like this? Any pointers as to what I could look at to help diagnose this odd problem? Could there be something I'm overlooking in the logs? Thanks! -- Josh
