Hi Andrew, I'll definitely bump up the heap on subsequent tests -- thanks for the tip. It was increased to 8 GB, but that didn't make any difference for the older YCSB.
Using your YCSB branch with the updated HBase client definitely makes a difference, however, showing consistent throughput for a little while. After a little bit of time, so far under about 5 minutes in the few times I ran it, it'll hit a NullPointerException[1] ... but it definitely seems to point more at a problem in the older YCSB. [1] https://gist.github.com/joshwilliams/0570a3095ad6417ca74f Thanks for your help, -- Josh On Thu, 2014-09-18 at 15:02 -0700, Andrew Purtell wrote: > 1 GB heap is nowhere enough to run if you're tying to test something > real (or approximate it with YCSB). Try 4 or 8, anything up to 31 GB, > use case dependent. >= 32 GB gives away compressed OOPs and maybe GC > issues. > > Also, I recently redid the HBase YCSB client in a modern way for >= > 0.98. See https://github.com/apurtell/YCSB/tree/new_hbase_client . It > performs in an IMHO more useful fashion than the previous for what > YCSB is intended, but might need some tuning (haven't tried it on a > cluster of significant size). One difference you should see is we > won't back up for 30-60 seconds after a bunch of threads flush fat 12+ > MB write buffers. > > On Thu, Sep 18, 2014 at 2:31 PM, Josh Williams <[email protected]> wrote: > > Ted, > > > > Stack trace, that's definitely a good idea. Here's one jstack snapshot > > from the region server while there's no apparent activity going on: > > https://gist.github.com/joshwilliams/4950c1d92382ea7f3160 > > > > If it's helpful, this is the YCSB side of the equation right around the > > same time: > > https://gist.github.com/joshwilliams/6fa3623088af9d1446a3 > > > > > > And Gary, > > > > As far as the memory configuration, that's a good question. Looks like > > HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB. There > > isn't any swap configured, and 12G of the memory on the instance is > > going to file cache, so there's definitely room to spare. > > > > Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE. > > Couldn't hurt to try that now... > > > > What's strange is running on m3.xlarge, which also has 15G of RAM but > > fewer CPU cores, it runs fine. > > > > Thanks to you both for the insight! > > > > -- Josh > > > > > > > > On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote: > >> What do you have HBASE_HEAPSIZE set to in hbase-env.sh? Is it > >> possible that you're overcommitting memory and the instance is > >> swapping? Just a shot in the dark, but I see that the m3.2xlarge > >> instance has 30G of memory vs. 15G for c3.2xlarge. > >> > >> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <[email protected]> wrote: > >> > bq. there's almost no activity on either side > >> > > >> > During this period, can you capture stack trace for the region server and > >> > pastebin the stack ? > >> > > >> > Cheers > >> > > >> > On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <[email protected]> > >> > wrote: > >> > > >> >> Hi, everyone. Here's a strange one, at least to me. > >> >> > >> >> I'm doing some performance profiling, and as a rudimentary test I've > >> >> been using YCSB to drive HBase (originally 0.98.3, recently updated to > >> >> 0.98.6.) The problem happens on a few different instance sizes, but > >> >> this is probably the closest comparison... > >> >> > >> >> On m3.2xlarge instances, works as expected. > >> >> On c3.2xlarge instances, HBase barely responds at all during workloads > >> >> that involve read activity, falling silent for ~62 second intervals, > >> >> with the YCSB throughput output resembling: > >> >> > >> >> 0 sec: 0 operations; > >> >> 2 sec: 918 operations; 459 current ops/sec; [UPDATE > >> >> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26] > >> >> 4 sec: 918 operations; 0 current ops/sec; > >> >> 6 sec: 918 operations; 0 current ops/sec; > >> >> <snip> > >> >> 62 sec: 918 operations; 0 current ops/sec; > >> >> 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE > >> >> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56] > >> >> 66 sec: 5302 operations; 0 current ops/sec; > >> >> 68 sec: 5302 operations; 0 current ops/sec; > >> >> (And so on...) > >> >> > >> >> While that happens there's almost no activity on either side, the CPU's > >> >> and disks are idle, no iowait at all. > >> >> > >> >> There isn't much that jumps out at me when digging through the Hadoop > >> >> and HBase logs, except that those 62-second intervals are often (but > >> >> note always) associated with ClosedChannelExceptions in the regionserver > >> >> logs. But I believe that's just HBase finding that a TCP connection it > >> >> wants to reply on had been closed. > >> >> > >> >> As far as I've seen this happens every time on this or any of the larger > >> >> c3 class of instances, surprisingly. The m3 instance class sizes all > >> >> seem to work fine. These are built with a custom AMI that has HBase and > >> >> all installed, and run via a script, so the different instance type > >> >> should be the only difference between them. > >> >> > >> >> Anyone seen anything like this? Any pointers as to what I could look at > >> >> to help diagnose this odd problem? Could there be something I'm > >> >> overlooking in the logs? > >> >> > >> >> Thanks! > >> >> > >> >> -- Josh > >> >> > >> >> > >> >> > > > > > > >
