Vladimir, thanks a lot for helping us out So I checked the no of RS in the master console. It was more than what we alloted.
Then I went to the list of FAIL_CLOSED regions, copied server names and then issued delete against those nodes in ZK. I restarted masters (I don't think i need to do this step) and now all regions show as fine Happy now! ________________________________ From: Vladimir Rodionov <[email protected]> Sent: Tuesday, May 23, 2017 2:41:30 PM To: [email protected] Subject: Re: Regions in Transition: FAILED_CLOSE status My bad, that is FAIL_CLOSE Anyway, start with Master log, find region name in a FAIL_CLOSE, check RS log that hosts this region. On Tue, May 23, 2017 at 2:35 PM, James Moore <[email protected]> wrote: > How many region servers are dead? and we're they colocated with DataNodes? > > On Tue, May 23, 2017 at 5:20 PM, Vladimir Rodionov <[email protected] > > > wrote: > > > When Master attempt to assign region to RS and assignment fails, there > > should be something in RS log file (check errors), > > that explains reason of a failure. > > > > How many not-assigned region do you have? You can try to assign them > > manually in hbase shell > > > > On Tue, May 23, 2017 at 1:25 PM, jeff saremi <[email protected]> > > wrote: > > > > > Are dead region servers to blame? Is this possibly stale information in > > > the ZK? > > > > > > ________________________________ > > > From: Vladimir Rodionov <[email protected]> > > > Sent: Tuesday, May 23, 2017 12:20:16 PM > > > To: [email protected] > > > Subject: Re: Regions in Transition: FAILED_CLOSE status > > > > > > You should check RS logs to see why regions can not be assigned. > > > Get RS name from master log and check RS log > > > > > > -Vlad > > > > > > On Tue, May 23, 2017 at 11:47 AM, jeff saremi <[email protected]> > > > wrote: > > > > > > > Our write code throws exceptions like the following: > > > > > > > > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: > > > > Failed 10331 actions: NotServingRegionException: 10331 times,at > > > > org.apache.hadoop.hbase.client.AsyncProcess$ > BatchErrors.makeException( > > > > AsyncProcess.java:258) > > > > at org.apache.hadoop.hbase.client.AsyncProcess$ > > > BatchErrors.access$2000( > > > > AsyncProcess.java:238) > > > > at org.apache.hadoop.hbase.client.AsyncProcess. > > > > waitForAllPreviousOpsAndReset(AsyncProcess.java:1817) > > > > at org.apache.hadoop.hbase.client.BufferedMutatorImpl. > > > > backgroundFlushCommits(BufferedMutatorImpl.java:240) > > > > at org.apache.hadoop.hbase.client.BufferedMutatorImpl. > > > > mutate(BufferedMutatorImpl.java:146) > > > > at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1028) > > > > at com.microsoft.bing.malta.hbaseClient11$$anon$2.run( > > > > ImageFeaturesHdfsToHbaseInjector.scala:115) > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > > > > > ________________________________ > > > > From: jeff saremi <[email protected]> > > > > Sent: Tuesday, May 23, 2017 11:36:11 AM > > > > To: [email protected] > > > > Subject: Regions in Transition: FAILED_CLOSE status > > > > > > > > Why are a few hundred of our regions in this state? and what can we > do > > to > > > > fix this? > > > > I have been running hbck a few times (is running one time enough?) to > > no > > > > avail. > > > > > > > > Internet search does not come up with anything useful either. > > > > > > > > I have restarted all masters and all region servers with no luck. > > > > > > > > Jeff > > > > > > > > > >
