I don’t understand the logging output, but I do see a strange pattern. I’ll try to summarize.
There are 5 RegionServers, call them rs1 through rs5. There are a total of 174 regions for the table in question, with 69 in rs1. In the log output I see lines (greatly simplified) like the following: AssignmentManager: Assigning fooTable, …. to rs2 AssignmentManager: Assigning fooTable, …. to rs3 AssignmentManager: Assigning fooTable, …. to rs4 AssignmentManager: Assigning fooTable, …. to rs5 There are 106 such lines, none logging an assignment to rs1 I also see 105 lines like: AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … dest=rs2 AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … dest=rs3 … where src=rs1 in every case, and dest=rs1 never occurs. I don’t see any exceptions or log output that reports a problem. On Jul 22, 2014, at 9:18 AM, Ted Yu <[email protected]> wrote: > The load balancer in 0.98 considers many factors when making balancing > decisions. > > Can you take a look at the master log and look for balancer related lines ? > That would give you some clue. > > Cheers > > On Jul 22, 2014, at 5:03 AM, Brian Jeltema <[email protected]> > wrote: > >> I ran the balancer from hbase shell, but don’t see any change. Is there a >> way to balance a specific table? >> >>> bq. One RegionServer has 69 regions >>> >>> Can you run load balancer so that your regions are better balanced ? >>> >>> Cheers >>> >>> >>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema < >>> [email protected]> wrote: >>> >>>> There are 174 regions, not well balanced. One RegionServer has 69 regions. >>>> That RegionServer generates a >>>> series of log entries (modified and shown below), one for each region, at >>>> roughly 1 to 2 second intervals. The timeout period expires when >>>> it reaches region 36. >>>> >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for >>>> hfiles >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references >>>> for [hdfs:// >>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2] >>>> hfiles >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file >>>> (1/1) : hdfs:// >>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2 >>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush >>>> Snapshotting region >>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6. >>>> completed. >>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region >>>> operation on >>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21 >>>> 07:49:45,137 DEBUG >>>> [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1] >>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on >>>> hosts,\x00\x8A\x90\xD6\x08,1400 >>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG >>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177' >>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager: >>>> Completed 1/174 local region snapshots. >>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush >>>> Snapshotting region >>>> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729. >>>> started... >>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for >>>> snapshot. >>>> >>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <[email protected]> >>>> wrote: >>>> >>>>> Can you also tell us more about your table? How many regions on how many >>>>> region servers? >>>>> >>>>> >>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <[email protected]>: >>>>> >>>>>> Normally such timeout is caused by one region server which is slow in >>>>>> completing its part of the snapshot procedure. >>>>>> >>>>>> Have you looked at region server logs ? >>>>>> Feel free to pastebin relevant portion. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema < >>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing >>>>>> out after 60 seconds. >>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and >>>>>> restarted HBase, >>>>>>> but the timeout still happens after 60 seconds. Any suggestions? >>>>>>> >>>>>>> Brian >> >
