Thanks for trying the new client out. Shame about that NPE, I'll look into it.
> On Sep 18, 2014, at 8:43 PM, Josh Williams <[email protected]> wrote: > > Hi Andrew, > > I'll definitely bump up the heap on subsequent tests -- thanks for the > tip. It was increased to 8 GB, but that didn't make any difference for > the older YCSB. > > Using your YCSB branch with the updated HBase client definitely makes a > difference, however, showing consistent throughput for a little while. > After a little bit of time, so far under about 5 minutes in the few > times I ran it, it'll hit a NullPointerException[1] ... but it > definitely seems to point more at a problem in the older YCSB. > > [1] https://gist.github.com/joshwilliams/0570a3095ad6417ca74f > > Thanks for your help, > > -- Josh > > >> On Thu, 2014-09-18 at 15:02 -0700, Andrew Purtell wrote: >> 1 GB heap is nowhere enough to run if you're tying to test something >> real (or approximate it with YCSB). Try 4 or 8, anything up to 31 GB, >> use case dependent. >= 32 GB gives away compressed OOPs and maybe GC >> issues. >> >> Also, I recently redid the HBase YCSB client in a modern way for >= >> 0.98. See https://github.com/apurtell/YCSB/tree/new_hbase_client . It >> performs in an IMHO more useful fashion than the previous for what >> YCSB is intended, but might need some tuning (haven't tried it on a >> cluster of significant size). One difference you should see is we >> won't back up for 30-60 seconds after a bunch of threads flush fat 12+ >> MB write buffers. >> >>> On Thu, Sep 18, 2014 at 2:31 PM, Josh Williams <[email protected]> >>> wrote: >>> Ted, >>> >>> Stack trace, that's definitely a good idea. Here's one jstack snapshot >>> from the region server while there's no apparent activity going on: >>> https://gist.github.com/joshwilliams/4950c1d92382ea7f3160 >>> >>> If it's helpful, this is the YCSB side of the equation right around the >>> same time: >>> https://gist.github.com/joshwilliams/6fa3623088af9d1446a3 >>> >>> >>> And Gary, >>> >>> As far as the memory configuration, that's a good question. Looks like >>> HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB. There >>> isn't any swap configured, and 12G of the memory on the instance is >>> going to file cache, so there's definitely room to spare. >>> >>> Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE. >>> Couldn't hurt to try that now... >>> >>> What's strange is running on m3.xlarge, which also has 15G of RAM but >>> fewer CPU cores, it runs fine. >>> >>> Thanks to you both for the insight! >>> >>> -- Josh >>> >>> >>> >>>> On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote: >>>> What do you have HBASE_HEAPSIZE set to in hbase-env.sh? Is it >>>> possible that you're overcommitting memory and the instance is >>>> swapping? Just a shot in the dark, but I see that the m3.2xlarge >>>> instance has 30G of memory vs. 15G for c3.2xlarge. >>>> >>>>> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <[email protected]> wrote: >>>>> bq. there's almost no activity on either side >>>>> >>>>> During this period, can you capture stack trace for the region server and >>>>> pastebin the stack ? >>>>> >>>>> Cheers >>>>> >>>>> On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, everyone. Here's a strange one, at least to me. >>>>>> >>>>>> I'm doing some performance profiling, and as a rudimentary test I've >>>>>> been using YCSB to drive HBase (originally 0.98.3, recently updated to >>>>>> 0.98.6.) The problem happens on a few different instance sizes, but >>>>>> this is probably the closest comparison... >>>>>> >>>>>> On m3.2xlarge instances, works as expected. >>>>>> On c3.2xlarge instances, HBase barely responds at all during workloads >>>>>> that involve read activity, falling silent for ~62 second intervals, >>>>>> with the YCSB throughput output resembling: >>>>>> >>>>>> 0 sec: 0 operations; >>>>>> 2 sec: 918 operations; 459 current ops/sec; [UPDATE >>>>>> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26] >>>>>> 4 sec: 918 operations; 0 current ops/sec; >>>>>> 6 sec: 918 operations; 0 current ops/sec; >>>>>> <snip> >>>>>> 62 sec: 918 operations; 0 current ops/sec; >>>>>> 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE >>>>>> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56] >>>>>> 66 sec: 5302 operations; 0 current ops/sec; >>>>>> 68 sec: 5302 operations; 0 current ops/sec; >>>>>> (And so on...) >>>>>> >>>>>> While that happens there's almost no activity on either side, the CPU's >>>>>> and disks are idle, no iowait at all. >>>>>> >>>>>> There isn't much that jumps out at me when digging through the Hadoop >>>>>> and HBase logs, except that those 62-second intervals are often (but >>>>>> note always) associated with ClosedChannelExceptions in the regionserver >>>>>> logs. But I believe that's just HBase finding that a TCP connection it >>>>>> wants to reply on had been closed. >>>>>> >>>>>> As far as I've seen this happens every time on this or any of the larger >>>>>> c3 class of instances, surprisingly. The m3 instance class sizes all >>>>>> seem to work fine. These are built with a custom AMI that has HBase and >>>>>> all installed, and run via a script, so the different instance type >>>>>> should be the only difference between them. >>>>>> >>>>>> Anyone seen anything like this? Any pointers as to what I could look at >>>>>> to help diagnose this odd problem? Could there be something I'm >>>>>> overlooking in the logs? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- Josh > >
