Juhani, Can you look at the storefiles and tell how they behave during the test? What is the size of the data you insert/update? Mikael On Mar 20, 2012 8:10 PM, "Juhani Connolly" <juha...@gmail.com> wrote:
> Hi Matt, > > this is something we haven't tested much, we were always running with > about 32 regions which gave enough coverage for an even spread over > all machines. > I will run our tests with enough regions per server to cover all cores > and get back to the ml > > On Tue, Mar 20, 2012 at 1:55 AM, Matt Corgan <mcor...@hotpads.com> wrote: > > I'd be curious to see what happens if you split the table into 1 region > per > > CPU core, so 24 cores * 11 servers = 264 regions. Each region has 1 > > memstore which is a ConcurrentSkipListMap, and you're currently hitting > > each CSLM with 8 cores which might be too contentious. Normally in > > production you would want multiple memstores per CPU core. > > > > > > On Mon, Mar 19, 2012 at 5:31 AM, Juhani Connolly <juha...@gmail.com> > wrote: > > > >> Actually we did try running off two machines both running our own > >> tests in parallel. Unfortunately the results were a split that results > >> in the same total throughput. We also did the same thing with iperf > >> running from each machine to another machine, indicating 800Mb > >> additional throughput between each pair of machines. > >> However we didn't try these tests very thoroughly so I will revisit > >> them as soon as I get back to the office, thanks. > >> > >> On Mon, Mar 19, 2012 at 9:21 PM, Christian Schäfer < > syrious3...@yahoo.de> > >> wrote: > >> > referring to my experiences I expect the client to be the bottleneck, > >> too. > >> > > >> > So try to increase the count of client-machines (not client threads) > >> each with its own unshared network interface. > >> > > >> > In my case I could double write throughput by doubling client machine > >> count with a much smaller system than yours (5 machines, 4gigs RAM > each). > >> > > >> > Good Luck > >> > Chris > >> > > >> > > >> > > >> > ________________________________ > >> > Von: Juhani Connolly <juha...@gmail.com> > >> > An: user@hbase.apache.org > >> > Gesendet: 13:02 Montag, 19.März 2012 > >> > Betreff: Re: 0.92 and Read/writes not scaling > >> > > >> > I was concerned that may be the case too, which is why we ran the ycsb > >> > tests in addition to our application specific and general performance > >> > tests. checking profiles of the execution just showed the vast > majority > >> of > >> > time spent waiting for responses. these were all run with 400 > >> > threads(though we tried more/less just in case) > >> > 2012/03/19 20:57 "Mingjian Deng" <koven2...@gmail.com>: > >> > > >> >> @Juhani: > >> >> How many clients did you test? Maybe the bottleneck was client? > >> >> > >> >> 2012/3/19 Ramkrishna.S.Vasudevan <ramkrishna.vasude...@huawei.com> > >> >> > >> >> > Hi Juhani > >> >> > > >> >> > Can you tell more on how the regions are balanced? > >> >> > Are you overloading only specific region server alone? > >> >> > > >> >> > Regards > >> >> > Ram > >> >> > > >> >> > > -----Original Message----- > >> >> > > From: Juhani Connolly [mailto:juha...@gmail.com] > >> >> > > Sent: Monday, March 19, 2012 4:11 PM > >> >> > > To: user@hbase.apache.org > >> >> > > Subject: 0.92 and Read/writes not scaling > >> >> > > > >> >> > > Hi, > >> >> > > > >> >> > > We're running into a brick wall where our throughput numbers will > >> not > >> >> > > scale as we increase server counts both using custom inhouse > tests > >> and > >> >> > > ycsb. > >> >> > > > >> >> > > We're using hbase 0.92 on hadoop 0.20.2(we also experience the > same > >> >> > > issues using 0.90 before switching our testing to this version). > >> >> > > > >> >> > > Our cluster consists of: > >> >> > > - Namenode and hmaster on separate servers, 24 core, 64gb > >> >> > > - up to 11 datanode/regionservers. 24 core, 64gb, 4 * 1tb > disks(hope > >> >> > > to get this changed) > >> >> > > > >> >> > > We have adjusted our gc settings, and mslabs: > >> >> > > > >> >> > > <property> > >> >> > > <name>hbase.hregion.memstore.mslab.enabled</name> > >> >> > > <value>true</value> > >> >> > > </property> > >> >> > > > >> >> > > <property> > >> >> > > <name>hbase.hregion.memstore.mslab.chunksize</name> > >> >> > > <value>2097152</value> > >> >> > > </property> > >> >> > > > >> >> > > <property> > >> >> > > <name>hbase.hregion.memstore.mslab.max.allocation</name> > >> >> > > <value>1024768</value> > >> >> > > </property> > >> >> > > > >> >> > > hdfs xceivers is set to 8192 > >> >> > > > >> >> > > We've experimented with a variety of handler counts for namenode, > >> >> > > datanodes and regionservers with no changes in throughput. > >> >> > > > >> >> > > For testing with ycsb, we do the following each time(with nothing > >> else > >> >> > > using the cluster): > >> >> > > - truncate test table > >> >> > > - add a small amount of data, then split the table into 32 > regions > >> and > >> >> > > call balancer from the shell. > >> >> > > - load 10m rows > >> >> > > - do a 1:2:7 insert:update:read test with 10million rows > (64k/sec) > >> >> > > - do a 5:5 insert:update test with 10 million rows (23k/sec) > >> >> > > - do a pure read test with 10 million rows (75k/sec) > >> >> > > > >> >> > > We have observed ganglia, iostat -d -x, iptraf, top, dstat and a > >> >> > > variety of other diagnostic tools and network/io/cpu/memory as > >> >> > > bottlenecks seem highly unlikely as none of them are ever > seriously > >> >> > > taxed. This leave me to assume this is some kind of locking > issue? > >> >> > > Delaying WAL flushes gives a small throughput bump but it doesn't > >> >> > > scale. > >> >> > > > >> >> > > There also doesn't seem to be many figures around to compare ours > >> to. > >> >> > > We can get our throughput numbers higher with tricks like not > >> writing > >> >> > > the WAL or delaying flushes, batching requests, but nothing > seems to > >> >> > > scale with additional slaves. > >> >> > > Could anyone provide guidance as to what may be preventing > >> throughput > >> >> > > figures from scaling as we increase our slave count? > >> >> > > >> >> > > >> >> > >> >