Hi Juhani Can you tell more on how the regions are balanced? Are you overloading only specific region server alone?
Regards Ram > -----Original Message----- > From: Juhani Connolly [mailto:juha...@gmail.com] > Sent: Monday, March 19, 2012 4:11 PM > To: user@hbase.apache.org > Subject: 0.92 and Read/writes not scaling > > Hi, > > We're running into a brick wall where our throughput numbers will not > scale as we increase server counts both using custom inhouse tests and > ycsb. > > We're using hbase 0.92 on hadoop 0.20.2(we also experience the same > issues using 0.90 before switching our testing to this version). > > Our cluster consists of: > - Namenode and hmaster on separate servers, 24 core, 64gb > - up to 11 datanode/regionservers. 24 core, 64gb, 4 * 1tb disks(hope > to get this changed) > > We have adjusted our gc settings, and mslabs: > > <property> > <name>hbase.hregion.memstore.mslab.enabled</name> > <value>true</value> > </property> > > <property> > <name>hbase.hregion.memstore.mslab.chunksize</name> > <value>2097152</value> > </property> > > <property> > <name>hbase.hregion.memstore.mslab.max.allocation</name> > <value>1024768</value> > </property> > > hdfs xceivers is set to 8192 > > We've experimented with a variety of handler counts for namenode, > datanodes and regionservers with no changes in throughput. > > For testing with ycsb, we do the following each time(with nothing else > using the cluster): > - truncate test table > - add a small amount of data, then split the table into 32 regions and > call balancer from the shell. > - load 10m rows > - do a 1:2:7 insert:update:read test with 10million rows (64k/sec) > - do a 5:5 insert:update test with 10 million rows (23k/sec) > - do a pure read test with 10 million rows (75k/sec) > > We have observed ganglia, iostat -d -x, iptraf, top, dstat and a > variety of other diagnostic tools and network/io/cpu/memory as > bottlenecks seem highly unlikely as none of them are ever seriously > taxed. This leave me to assume this is some kind of locking issue? > Delaying WAL flushes gives a small throughput bump but it doesn't > scale. > > There also doesn't seem to be many figures around to compare ours to. > We can get our throughput numbers higher with tricks like not writing > the WAL or delaying flushes, batching requests, but nothing seems to > scale with additional slaves. > Could anyone provide guidance as to what may be preventing throughput > figures from scaling as we increase our slave count?