If you did the change, can you share your experience/results? On Wed, Dec 15, 2010 at 12:04 AM, Jan Lukavský <[email protected] > wrote:
> We can give it a try. Currently we use 512 MiB per region, is there any > upper bound for this value which is not recommended to cross? Are there any > side-effects we may expect when we set this value to say 1 GiB? I suppose at > least a bit longer random gets? > > Thanks, > Jan > > > On 14.12.2010 18:50, Stack wrote: > >> Can you do w/ less regions? 1k plus per server is pushing it I'd say. >> Can you up your region sizes, for instance? >> St.Ack >> >> On Mon, Dec 13, 2010 at 8:36 AM, Jan Lukavský >> <[email protected]> wrote: >> >>> Hi all, >>> >>> we are using HBase 0.20.6 on a cluster of about 25 nodes with about 30k >>> regions and are experiencing as issue which causes running M/R jobs to >>> fail. >>> When we restart single RegionServer, then happens the following: >>> 1) all regions of that RS get reassigned to remaing (say 24) nodes >>> 2) when the restarted RegionServer comes up, HMaster closes about 60 >>> regions on all 24 nodes and assigns them back to the restarted node >>> >>> Now, the step 1) is usually very quick (if we can assign 10 regions per >>> heartbeat, we have 240 regions per heartbeat on the whole cluster). >>> The step 2) seems problematic, because first about 1200 regions get >>> unassigned, and then they get slowly assigned to the single RS (speed >>> again >>> 10 regions per heartbeat). This time causes clients of Maps connected to >>> the >>> regions to throw RetriesExhaustedException. >>> >>> I'm aware that we can limit number of regions closed per RegionServer >>> heartbeat by hbase.regions.close.max, but this config option seems a bit >>> unsatisfactory, because as we increase size of the cluster, we will get >>> more >>> and more regions unassigned in single cluster heartbeat (say we limit >>> this >>> to 1, then we get 24 unassigned regions, but only 10 assigned per >>> heartbeat). This led us to a solution, which seems quite simple. We have >>> introduced new config option which is used to limit number of regions in >>> transition. When regionsInTransition.size() crosses boundary, we >>> temporarily >>> stop load balancer. This seems to resolve our issue, because no region >>> gets >>> unassigned for long time and clients manage to recover within their >>> number >>> of retries. >>> >>> My question is, is this s general issue and a new config option should be >>> proposed, or I am missing something a we could have resolved the issue >>> with >>> some other config option tuning? >>> >>> Thanks. >>> Jan >>> >>> >>> > > -- > > Jan Lukavský > programátor > Seznam.cz, a.s. > Radlická 608/2 > 15000, Praha 5 > > [email protected] > http://www.seznam.cz > >
