Yes, im waiting on a response from them. It's just.. the ping difference is tiny while the scan difference is huge, 2sec vs 4sec.
Note the ping I mentioned is within the cluster. Ping from outside into the cluster have hardly any (if at all) noticeable difference. On Sat, Dec 21, 2013 at 8:37 PM, Pradeep Gollakota <[email protected]>wrote: > Do you know if machines 19-23 are on a different rack? It seems to me that > your problem might be a networking problem. Whether it is hardware, > configuration or something else entirely, I'm not sure. It might be > worthwhile to talk to your systems administrator to see why pings to these > machines are slow. What are the pings like from a bad RS to another bad RS? > > > On Sat, Dec 21, 2013 at 7:17 PM, Kristoffer Sjögren <[email protected] > >wrote: > > > Hi > > > > I have been performance tuning HBase 0.94.6 running Phoenix 2.2.0 the > last > > couple of days and need some help. > > > > Background. > > > > - 23 machine cluster, 32 cores, 4GB heap per RS. > > - Table t_24 have 24 online regions (24 salt buckets). > > - Table t_96 have 96 online regions (96 salt buckets). > > - 10.5 million rows per table. > > - Count query - select (*) from ... > > - Group by query - select A, B, C sum(D) from ... where (A = 1 and T >= 0 > > and T <= 2147482800) group by A, B, C; > > > > What I found ultimately is that region servers 19, 20, 21, 22 and 23 > > are consistently > > 2-3x slower than the others. This hurts overall latency pretty bad since > > queries are executed in parallel on the RS and then aggregated at the > > client (through Phoenix). In Hannibal regions spread out evenly over > region > > servers, according to salt buckets (phoenix feature, pre-create regions > and > > a rowkey prefix). > > > > As far as I can tell, there is no network or hardware configuration > > divergence between the machines. No CPU, network or other notable > > divergence > > in Ganglia. No RS metric differences in HBase master console. > > > > The only thing that may be of interest is that pings (within the cluster) > > to > > bad RS is about 2-3x slower, around 0.050ms vs 0.130ms. Not sure if > > this is significant, > > but I get a bad feeling about it since it match exactly with the RS that > > stood out in my performance tests. > > > > Any ideas of how I might find the source of this problem? > > > > Cheers, > > -Kristoffer > > >
