Re: snapshot timeout problem

Ted Yu Tue, 22 Jul 2014 08:54:47 -0700

You can leave your config value there.
Remember to record such change in a place for future reference - you may
change other cost parameter later.


The side-effects of this change partially depend on how you want your
cluster balanced. I suggest you go over the CostFunction's in
StochasticLoadBalancer
so that you know which factors (and their weights) load balancer considers.

Cheers


On Tue, Jul 22, 2014 at 8:43 AM, Brian Jeltema <
[email protected]> wrote:

> That did the trick. I set it to 100 and regions are uniform now. Should I
> leave it there? What are the side-effects of this change?
>
> Thanks.
>
> Brian
>
> On Jul 22, 2014, at 11:28 AM, Ted Yu <[email protected]> wrote:
>
> > Here is code snippet from StochasticLoadBalancer
> > w.r.t. TableSkewCostFunction :
> >
> >    private static final String TABLE_SKEW_COST_KEY =
> >
> >        "hbase.master.balancer.stochastic.tableSkewCost";
> >
> >    private static final float DEFAULT_TABLE_SKEW_COST = 35;
> >
> >    TableSkewCostFunction(Configuration conf) {
> >
> >      super(conf);
> >
> >      this.setMultiplier(conf.getFloat(TABLE_SKEW_COST_KEY,
> > DEFAULT_TABLE_SKEW_COST));
> >
> > You can try increasing the value for
> > "hbase.master.balancer.stochastic.tableSkewCost"
> >
> >
> > Cheers
> >
> >
> > On Tue, Jul 22, 2014 at 6:59 AM, Brian Jeltema <
> > [email protected]> wrote:
> >
> >> I don’t understand the logging output, but I do see a strange pattern.
> >> I’ll try to summarize.
> >>
> >> There are 5 RegionServers, call them rs1 through rs5. There are a total
> of
> >> 174 regions for the table in question,
> >> with 69 in rs1. In the log output I see lines (greatly simplified) like
> >> the following:
> >>
> >>   AssignmentManager: Assigning fooTable, …. to rs2
> >>   AssignmentManager: Assigning fooTable, …. to rs3
> >>   AssignmentManager: Assigning fooTable, …. to rs4
> >>   AssignmentManager: Assigning fooTable, …. to rs5
> >>
> >> There are 106 such lines, none logging an assignment to rs1
> >>
> >> I also see 105 lines like:
> >>
> >>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> >> dest=rs2
> >>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> >> dest=rs3
> >>  …
> >>
> >> where src=rs1 in every case, and dest=rs1 never occurs.
> >>
> >> I don’t see any exceptions or log output that reports a problem.
> >>
> >>
> >> On Jul 22, 2014, at 9:18 AM, Ted Yu <[email protected]> wrote:
> >>
> >>> The load balancer in 0.98 considers many factors when making balancing
> >> decisions.
> >>>
> >>> Can you take a look at the master log and look for balancer related
> >> lines ?
> >>> That would give you some clue.
> >>>
> >>> Cheers
> >>>
> >>> On Jul 22, 2014, at 5:03 AM, Brian Jeltema <
> >> [email protected]> wrote:
> >>>
> >>>> I ran the balancer from hbase shell, but don’t see any change. Is
> there
> >> a way to balance a specific table?
> >>>>
> >>>>> bq. One RegionServer has 69 regions
> >>>>>
> >>>>> Can you run load balancer so that your regions are better balanced ?
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> There are 174 regions, not well balanced. One RegionServer has 69
> >> regions.
> >>>>>> That RegionServer generates a
> >>>>>> series of log entries (modified and shown below), one for each
> >> region, at
> >>>>>> roughly 1 to 2 second intervals. The timeout period expires when
> >>>>>> it reaches region 36.
> >>>>>>
> >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references
> for
> >>>>>> hfiles
> >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot
> >> references
> >>>>>> for [hdfs://
> >>>>>>
> >>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> >> ]
> >>>>>> hfiles
> >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for
> >> file
> >>>>>> (1/1) : hdfs://
> >>>>>>
> >>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> >>>>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ...
> Flush
> >>>>>> Snapshotting region
> >>>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
> >>>>>> completed.
> >>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing
> >> region
> >>>>>> operation on
> >>>>>>
> >>
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
> >>>>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net
> >> ,60020,1405943192177)-snapshot-pool3-thread-1]
> >>>>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> >>>>>> hosts,\x00\x8A\x90\xD6\x08,1400
> >>>>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137
> >> DEBUG
> >>>>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
> >>>>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
> >>>>>> Completed 1/174 local region snapshots.
> >>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
> >>>>>> Snapshotting region
> >>>>>>
> >>
> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
> >>>>>> started...
> >>>>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info
> for
> >>>>>> snapshot.
> >>>>>>
> >>>>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Can you also tell us more about your table? How many regions on how
> >> many
> >>>>>>> region servers?
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <[email protected]>:
> >>>>>>>
> >>>>>>>> Normally such timeout is caused by one region server which is slow
> >> in
> >>>>>>>> completing its part of the snapshot procedure.
> >>>>>>>>
> >>>>>>>> Have you looked at region server logs ?
> >>>>>>>> Feel free to pastebin relevant portion.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
> >>>>>> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s
> >> timing
> >>>>>>>> out after 60 seconds.
> >>>>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
> >>>>>>>> restarted HBase,
> >>>>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
> >>>>>>>>>
> >>>>>>>>> Brian
> >>>>
> >>>
> >>
> >>
>
>

Re: snapshot timeout problem

Reply via email to