Re: snapshot timeout problem

Brian Jeltema Tue, 22 Jul 2014 07:00:42 -0700

I don’t understand the logging output, but I do see a strange pattern. I’ll try 
to summarize.


There are 5 RegionServers, call them rs1 through rs5. There are a total of 174 
regions for the table in question,
with 69 in rs1. In the log output I see lines (greatly simplified) like the 
following:

   AssignmentManager: Assigning fooTable, …. to rs2
   AssignmentManager: Assigning fooTable, …. to rs3
   AssignmentManager: Assigning fooTable, …. to rs4
   AssignmentManager: Assigning fooTable, …. to rs5

There are 106 such lines, none logging an assignment to rs1

I also see 105 lines like:

  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … dest=rs2
  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … dest=rs3
  …

where src=rs1 in every case, and dest=rs1 never occurs.

I don’t see any exceptions or log output that reports a problem.


On Jul 22, 2014, at 9:18 AM, Ted Yu <[email protected]> wrote:

> The load balancer in 0.98 considers many factors when making balancing 
> decisions. 
> 
> Can you take a look at the master log and look for balancer related lines ?
> That would give you some clue. 
> 
> Cheers
> 
> On Jul 22, 2014, at 5:03 AM, Brian Jeltema <[email protected]> 
> wrote:
> 
>> I ran the balancer from hbase shell, but don’t see any change. Is there a 
>> way to balance a specific table?
>> 
>>> bq. One RegionServer has 69 regions
>>> 
>>> Can you run load balancer so that your regions are better balanced ?
>>> 
>>> Cheers
>>> 
>>> 
>>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
>>> [email protected]> wrote:
>>> 
>>>> There are 174 regions, not well balanced. One RegionServer has 69 regions.
>>>> That RegionServer generates a
>>>> series of log entries (modified and shown below), one for each region, at
>>>> roughly 1 to 2 second intervals. The timeout period expires when
>>>> it reaches region 36.
>>>> 
>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
>>>> hfiles
>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references
>>>> for [hdfs://
>>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2]
>>>> hfiles
>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file
>>>> (1/1) : hdfs://
>>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
>>>> Snapshotting region
>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
>>>> completed.
>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region
>>>> operation on
>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
>>>> 07:49:45,137 DEBUG 
>>>> [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1]
>>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
>>>> hosts,\x00\x8A\x90\xD6\x08,1400
>>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG
>>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
>>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
>>>> Completed 1/174 local region snapshots.
>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
>>>> Snapshotting region
>>>> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
>>>> started...
>>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
>>>> snapshot.
>>>> 
>>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <[email protected]>
>>>> wrote:
>>>> 
>>>>> Can you also tell us more about your table? How many regions on how many
>>>>> region servers?
>>>>> 
>>>>> 
>>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <[email protected]>:
>>>>> 
>>>>>> Normally such timeout is caused by one region server which is slow in
>>>>>> completing its part of the snapshot procedure.
>>>>>> 
>>>>>> Have you looked at region server logs ?
>>>>>> Feel free to pastebin relevant portion.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
>>>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
>>>>>> out after 60 seconds.
>>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
>>>>>> restarted HBase,
>>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
>>>>>>> 
>>>>>>> Brian
>> 
>

Re: snapshot timeout problem

Reply via email to