Hi, We're running into a brick wall where our throughput numbers will not scale as we increase server counts both using custom inhouse tests and ycsb.
We're using hbase 0.92 on hadoop 0.20.2(we also experience the same issues using 0.90 before switching our testing to this version). Our cluster consists of: - Namenode and hmaster on separate servers, 24 core, 64gb - up to 11 datanode/regionservers. 24 core, 64gb, 4 * 1tb disks(hope to get this changed) We have adjusted our gc settings, and mslabs: <property> <name>hbase.hregion.memstore.mslab.enabled</name> <value>true</value> </property> <property> <name>hbase.hregion.memstore.mslab.chunksize</name> <value>2097152</value> </property> <property> <name>hbase.hregion.memstore.mslab.max.allocation</name> <value>1024768</value> </property> hdfs xceivers is set to 8192 We've experimented with a variety of handler counts for namenode, datanodes and regionservers with no changes in throughput. For testing with ycsb, we do the following each time(with nothing else using the cluster): - truncate test table - add a small amount of data, then split the table into 32 regions and call balancer from the shell. - load 10m rows - do a 1:2:7 insert:update:read test with 10million rows (64k/sec) - do a 5:5 insert:update test with 10 million rows (23k/sec) - do a pure read test with 10 million rows (75k/sec) We have observed ganglia, iostat -d -x, iptraf, top, dstat and a variety of other diagnostic tools and network/io/cpu/memory as bottlenecks seem highly unlikely as none of them are ever seriously taxed. This leave me to assume this is some kind of locking issue? Delaying WAL flushes gives a small throughput bump but it doesn't scale. There also doesn't seem to be many figures around to compare ours to. We can get our throughput numbers higher with tricks like not writing the WAL or delaying flushes, batching requests, but nothing seems to scale with additional slaves. Could anyone provide guidance as to what may be preventing throughput figures from scaling as we increase our slave count?