That did the trick. I set it to 100 and regions are uniform now. Should I leave it there? What are the side-effects of this change?
Thanks. Brian On Jul 22, 2014, at 11:28 AM, Ted Yu <[email protected]> wrote: > Here is code snippet from StochasticLoadBalancer > w.r.t. TableSkewCostFunction : > > private static final String TABLE_SKEW_COST_KEY = > > "hbase.master.balancer.stochastic.tableSkewCost"; > > private static final float DEFAULT_TABLE_SKEW_COST = 35; > > TableSkewCostFunction(Configuration conf) { > > super(conf); > > this.setMultiplier(conf.getFloat(TABLE_SKEW_COST_KEY, > DEFAULT_TABLE_SKEW_COST)); > > You can try increasing the value for > "hbase.master.balancer.stochastic.tableSkewCost" > > > Cheers > > > On Tue, Jul 22, 2014 at 6:59 AM, Brian Jeltema < > [email protected]> wrote: > >> I don’t understand the logging output, but I do see a strange pattern. >> I’ll try to summarize. >> >> There are 5 RegionServers, call them rs1 through rs5. There are a total of >> 174 regions for the table in question, >> with 69 in rs1. In the log output I see lines (greatly simplified) like >> the following: >> >> AssignmentManager: Assigning fooTable, …. to rs2 >> AssignmentManager: Assigning fooTable, …. to rs3 >> AssignmentManager: Assigning fooTable, …. to rs4 >> AssignmentManager: Assigning fooTable, …. to rs5 >> >> There are 106 such lines, none logging an assignment to rs1 >> >> I also see 105 lines like: >> >> AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … >> dest=rs2 >> AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … >> dest=rs3 >> … >> >> where src=rs1 in every case, and dest=rs1 never occurs. >> >> I don’t see any exceptions or log output that reports a problem. >> >> >> On Jul 22, 2014, at 9:18 AM, Ted Yu <[email protected]> wrote: >> >>> The load balancer in 0.98 considers many factors when making balancing >> decisions. >>> >>> Can you take a look at the master log and look for balancer related >> lines ? >>> That would give you some clue. >>> >>> Cheers >>> >>> On Jul 22, 2014, at 5:03 AM, Brian Jeltema < >> [email protected]> wrote: >>> >>>> I ran the balancer from hbase shell, but don’t see any change. Is there >> a way to balance a specific table? >>>> >>>>> bq. One RegionServer has 69 regions >>>>> >>>>> Can you run load balancer so that your regions are better balanced ? >>>>> >>>>> Cheers >>>>> >>>>> >>>>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema < >>>>> [email protected]> wrote: >>>>> >>>>>> There are 174 regions, not well balanced. One RegionServer has 69 >> regions. >>>>>> That RegionServer generates a >>>>>> series of log entries (modified and shown below), one for each >> region, at >>>>>> roughly 1 to 2 second intervals. The timeout period expires when >>>>>> it reaches region 36. >>>>>> >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for >>>>>> hfiles >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot >> references >>>>>> for [hdfs:// >>>>>> >> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2 >> ] >>>>>> hfiles >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for >> file >>>>>> (1/1) : hdfs:// >>>>>> >> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2 >>>>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush >>>>>> Snapshotting region >>>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6. >>>>>> completed. >>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing >> region >>>>>> operation on >>>>>> >> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21 >>>>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net >> ,60020,1405943192177)-snapshot-pool3-thread-1] >>>>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on >>>>>> hosts,\x00\x8A\x90\xD6\x08,1400 >>>>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 >> DEBUG >>>>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177' >>>>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager: >>>>>> Completed 1/174 local region snapshots. >>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush >>>>>> Snapshotting region >>>>>> >> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729. >>>>>> started... >>>>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for >>>>>> snapshot. >>>>>> >>>>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari < >> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Can you also tell us more about your table? How many regions on how >> many >>>>>>> region servers? >>>>>>> >>>>>>> >>>>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <[email protected]>: >>>>>>> >>>>>>>> Normally such timeout is caused by one region server which is slow >> in >>>>>>>> completing its part of the snapshot procedure. >>>>>>>> >>>>>>>> Have you looked at region server logs ? >>>>>>>> Feel free to pastebin relevant portion. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema < >>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s >> timing >>>>>>>> out after 60 seconds. >>>>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and >>>>>>>> restarted HBase, >>>>>>>>> but the timeout still happens after 60 seconds. Any suggestions? >>>>>>>>> >>>>>>>>> Brian >>>> >>> >> >>
