Thanks J-D. I have filed an issue and attached the logs: https://issues.apache.org/jira/browse/HBASE-4031
You can check the logs whether they can give you all the missing information. >> What happened to the first master? We killed the active one and let the standby became the active one. For we took some tests on the Master-switch. >> How come 1306205940117 went from 5841 regions to 0? This regionserver got some exceptions and aborted. It seemed that there's no master during the time, so no ServerShutdownHandler process happened. Jieshan Bean. --------------------------------------------------------------------------- I feel like I'm missing too much information to be helpful, for example when the standby master comes up it needs to 13134 RIT. What happened there? I thought the regions were all assigned? What happened to the first master? How come 1306205940117 whent from 5841 regions to 0? Thx for filling the gaps, J-D On Thu, Jun 23, 2011 at 6:35 PM, bijieshan <[email protected]> wrote: > Hi, > > I found the problem while the cluster couldn't balance. One node's regions > count is the double of the other nodes. And it didn't move regions anymore: > Address Start Code Load > 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, > maxHeap=8165 > 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, > maxHeap=8165 > 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, > maxHeap=8165 > 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, > maxHeap=8165 > Total: servers: 4 requests=0, regions=13689 > > > HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found > by my analysis on this problem. > But I'm afraid it's not the main cause of the problem. > > There's one active master, one standby master, four regionservers in our > cluster. > >>>10:57:41, the standby hamster 222 becomes the active one. > 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master > startup proceeding: master failover > >>>4 regionservers was registered in 222 one by one. Only one regionserver >>>seemed some time late. > 2011-05-24 10:57:37,533 INFO : Registering > server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true > 2011-05-24 10:57:37,537 INFO : Registering > server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true > 2011-05-24 10:57:37,598 INFO : Registering > server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true > 2011-05-24 10:59:00,408 INFO : Registering > server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false > >>>13134 regions needed to move after rebuildUserRegions(13689 regions in the >>>cluster during the time). > 2011-05-24 10:58:47,534 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to > process 13134 regions in transition > >>>All the 13134 regions were opened, regions opened count in each server: > 158-1-101-222,20020,1306205940117 Count: 834 > 158-1-101-82,20020,1306205415714 Count: 4093 > 158-1-101-202,20020,1306205409671 Count: 4118 > 158-1-101-52,20020,1306205417261 Count: 4089 > >>>The nearest balancer calculate results: > 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: > Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded > servers onto 1 less loaded servers > > "5012" is an unimaginable number here, for it is larger than the average > number "3424.5" > > > Jieshan Bean > > >
