Re: snapshot timeout problem

Brian Jeltema Tue, 22 Jul 2014 08:44:30 -0700

That did the trick. I set it to 100 and regions are uniform now. Should I leave 
it there? What are the side-effects of this change?


Thanks.

Brian

On Jul 22, 2014, at 11:28 AM, Ted Yu <[email protected]> wrote:

> Here is code snippet from StochasticLoadBalancer
> w.r.t. TableSkewCostFunction :
> 
>    private static final String TABLE_SKEW_COST_KEY =
> 
>        "hbase.master.balancer.stochastic.tableSkewCost";
> 
>    private static final float DEFAULT_TABLE_SKEW_COST = 35;
> 
>    TableSkewCostFunction(Configuration conf) {
> 
>      super(conf);
> 
>      this.setMultiplier(conf.getFloat(TABLE_SKEW_COST_KEY,
> DEFAULT_TABLE_SKEW_COST));
> 
> You can try increasing the value for
> "hbase.master.balancer.stochastic.tableSkewCost"
> 
> 
> Cheers
> 
> 
> On Tue, Jul 22, 2014 at 6:59 AM, Brian Jeltema <
> [email protected]> wrote:
> 
>> I don’t understand the logging output, but I do see a strange pattern.
>> I’ll try to summarize.
>> 
>> There are 5 RegionServers, call them rs1 through rs5. There are a total of
>> 174 regions for the table in question,
>> with 69 in rs1. In the log output I see lines (greatly simplified) like
>> the following:
>> 
>>   AssignmentManager: Assigning fooTable, …. to rs2
>>   AssignmentManager: Assigning fooTable, …. to rs3
>>   AssignmentManager: Assigning fooTable, …. to rs4
>>   AssignmentManager: Assigning fooTable, …. to rs5
>> 
>> There are 106 such lines, none logging an assignment to rs1
>> 
>> I also see 105 lines like:
>> 
>>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
>> dest=rs2
>>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
>> dest=rs3
>>  …
>> 
>> where src=rs1 in every case, and dest=rs1 never occurs.
>> 
>> I don’t see any exceptions or log output that reports a problem.
>> 
>> 
>> On Jul 22, 2014, at 9:18 AM, Ted Yu <[email protected]> wrote:
>> 
>>> The load balancer in 0.98 considers many factors when making balancing
>> decisions.
>>> 
>>> Can you take a look at the master log and look for balancer related
>> lines ?
>>> That would give you some clue.
>>> 
>>> Cheers
>>> 
>>> On Jul 22, 2014, at 5:03 AM, Brian Jeltema <
>> [email protected]> wrote:
>>> 
>>>> I ran the balancer from hbase shell, but don’t see any change. Is there
>> a way to balance a specific table?
>>>> 
>>>>> bq. One RegionServer has 69 regions
>>>>> 
>>>>> Can you run load balancer so that your regions are better balanced ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> 
>>>>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> There are 174 regions, not well balanced. One RegionServer has 69
>> regions.
>>>>>> That RegionServer generates a
>>>>>> series of log entries (modified and shown below), one for each
>> region, at
>>>>>> roughly 1 to 2 second intervals. The timeout period expires when
>>>>>> it reaches region 36.
>>>>>> 
>>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
>>>>>> hfiles
>>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot
>> references
>>>>>> for [hdfs://
>>>>>> 
>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>> ]
>>>>>> hfiles
>>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for
>> file
>>>>>> (1/1) : hdfs://
>>>>>> 
>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>>>>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
>>>>>> Snapshotting region
>>>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
>>>>>> completed.
>>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing
>> region
>>>>>> operation on
>>>>>> 
>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
>>>>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net
>> ,60020,1405943192177)-snapshot-pool3-thread-1]
>>>>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
>>>>>> hosts,\x00\x8A\x90\xD6\x08,1400
>>>>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137
>> DEBUG
>>>>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
>>>>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
>>>>>> Completed 1/174 local region snapshots.
>>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
>>>>>> Snapshotting region
>>>>>> 
>> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
>>>>>> started...
>>>>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
>>>>>> snapshot.
>>>>>> 
>>>>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <
>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Can you also tell us more about your table? How many regions on how
>> many
>>>>>>> region servers?
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <[email protected]>:
>>>>>>> 
>>>>>>>> Normally such timeout is caused by one region server which is slow
>> in
>>>>>>>> completing its part of the snapshot procedure.
>>>>>>>> 
>>>>>>>> Have you looked at region server logs ?
>>>>>>>> Feel free to pastebin relevant portion.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s
>> timing
>>>>>>>> out after 60 seconds.
>>>>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
>>>>>>>> restarted HBase,
>>>>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
>>>>>>>>> 
>>>>>>>>> Brian
>>>> 
>>> 
>> 
>>

Re: snapshot timeout problem

Reply via email to